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Preface of the First Author 


This book was written on the basis of a graduate course on mathematical statistics 
given at the mathematical faculty of the Humboldt-University Berlin. 

The classical theory of parametric estimation, since the seminal works by Fisher, 
Wald, and Le Cam, among many others, has now reached maturity and an elegant 
form. It can be considered as more or less complete, at least for the so-called regular 
case. The question of the optimality and efficiency of the classical methods has been 
rigorously studied and typical results state the asymptotic normality and efficiency 
of the maximum likelihood and/or Bayes estimates; see an excellent monograph by 
Ibragimov and Khas’ minskij (1981) for a comprehensive study. 

In the time around 1984 when I started my own Ph.D. at the Lomonosoff 
University, a popular joke in our statistical community in Moscow was that all 
the problems in the parametric statistical theory have been solved and described 
in a complete way in Ibragimov and Khas’ minskij (1981), there is nothing to do 
any more for mathematical statisticians. If at all, only few nonparametric problems 
remain open. After finishing my Ph.D. I also moved to nonparametric statistics for 
a while with the focus on local adaptive estimation. In the year 2005 I started to 
write a monograph on nonparametric estimation using local parametric methods 
which was supposed to systemize my previous experience in this area. The very first 
draft of this book was available already in the autumn 2005, and it only included 
few sections about basics of parametric estimation. However, attempts to prepare 
a more systematic and more general presentation of the nonparametric theory led 
me back to the very basic parametric concepts. In 2007 I extended significantly the 
part about parametric methods. In the spring 2009 I taught a graduate course on 
parametric statistics at the mathematical faculty of the Humboldt-University Berlin. 
My intention was to present a “modern” version of the classical theory which in 
particular addresses the following questions: 


what do you need to know from parametric statistics to work on modern parametric and 
nonparametric methods? 
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how to identify the borderline between the classical parametric and the modern nonpara- 
metric statistics? 


The basic assumptions of the classical parametric theory are that the parametric 
specification is exact and the sample size is large relative to the dimension of the 
parameter space. Unfortunately, this viewpoint limits applicability of the classical 
theory: it is usually unrealistic to assume that the parametric specification is fulfilled 
exactly. So, the modern version of the parametric theory has to include a possible 
model misspecification. The issue of large samples is even more critical. Many 
modern applications face a situation when the number of parameters p is not 
only comparable with the sample size n, it can be even much larger than n. It 
is probably the main challenge of the modern parametric theory to include in a 
rigorous way the case of “large p small n.” One can say that the parametric theory 
that is able to systematically treat the issues of model misspecification and of small 
fixed samples already includes the nonparametric statistics. The present study aims 
at reconsidering the basics of the parametric theory in this sense. The “modern 
parametric” view can be stressed as follows: 


- any model is parametric; 
- any parametric model is wrong; 
- even a wrong model can be useful. 


The model mentioned in the first item can be understood as a set of assumptions 
describing the unknown distribution of the underlying data. This description is 
usually given in terms of some parameters. The parameter space can be large or 
infinite dimensional, however, the model is uniquely specified by the parameter 
value. In this sense “any model is parametric.” 

The second statement “any parametric model is wrong” means that any imag- 
inary model is only an idealization (approximation) of reality. It is unrealistic to 
assume that the data exactly follow the parametric model, even if this model is 
flexible and involves a lot of parameters. Model misspecification naturally leads 
to the notion of the modeling bias measuring the distance between the underly- 
ing model and the selected parametric family. It also separates parametric and 
nonparametric viewpoint. The parametric approach focuses on “estimation within 
the model” ignoring the modeling bias. The nonparametric approach attempts to 
account for the modeling bias and to optimize the joint impact of two kinds of errors: 
estimation error within the model and the modeling bias. This volume is limited to 
parametric estimation and testing for some special models like exponential families 
or linear models. However, it prepares some important tools for doing the general 
parametric theory presented in the second volume. 

The last statement “even a wrong model can be useful” introduces the notion of a 
“useful” parametric specification. In some sense it indicates a change of a paradigm 
in the parametric statistics. Trying to find the true model is hopeless anyway. Instead, 
one aims at taking a potentially wrong parametric model which, however, possesses 
some useful properties. Among others, one can figure out the following “useful” 
features: 


Preface ix 


- anice geometric structure of the likelihood leading to a numerically efficient estimation 
procedure; 
- parameter identifiability. 


Lack of identifiability in the considered model is just an indication that 
the considered parametric model is poorly selected. A proper parametrization 
should involve a reasonable regularization ensuring both features: numerical 
efficiency/stability and a proper parameter identification. The present volume 
presents some examples of “useful models” like linear or exponential families. The 
second volume will extend such models to a quite general regular case involving 
some smoothness and moment conditions on the log-likelihood process of the 
considered parametric family. 

This book does not pretend to systematically cover the scope of the classical 
parametric theory. Some very important and even fundamental issues are not 
considered at all in this book. One characteristic example is given by the notion of 
sufficiency, which can be hardly combined with model misspecification. At the same 
time, much more attention is paid to the questions of nonasymptotic inference under 
model misspecification including concentration and confidence sets in dependence 
of the sample size and dimensionality of the parameter space. In the first volume 
we especially focus on linear models. This can be explained by their role for the 
general theory in which a linear model naturally arises from local approximation of 
a general regular model. 

This volume can be used as textbook for a graduate course in mathematical 
statistics. It assumes that the reader is familiar with the basic notions of the 
probability theory including the Lebesgue measure, Radon—Nycodim derivative, 
etc. Knowledge of basic statistics is not required. I tried to be as self-contained as 
possible; the most of the presented results are proved in a rigorous way. Sometimes 
the details are left to the reader as exercises, in those cases some hints are given. 


Preface of the Second Author 


It was in early 2012 when Prof. Spokoiny approached me with the idea of a joint 
lecture on Mathematical Statistics at Humboldt-University Berlin, where I was a 
junior professor at that time. Up to then, my own education in statistical inference 
had been based on the German textbooks by Witting (1985) and Witting and Miiller- 
Funk (1995), and for teaching in English I had always used the books by Lehmann 
and Casella (1998), Lehmann and Romano (2005), and Lehmann (1999). However, 
I was aware of Prof. Spokoiny’s own textbook project and so the question was which 
text to use as the basis for the lecture. Finally, the first part of the lecture (estimation 
theory) was given by Prof. Spokoiny based on the by then already substantiated 
Chaps. 1-5 of the present work, while I gave the second part on test theory based 
on my own teaching material which was mainly based on Lehmann and Romano 
(2005). 
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This joint teaching activity turned out to be the starting point of a collaboration 
between Prof. Spokoiny and myself, and I was invited to join him as a coauthor 
of the present work for the Chaps. 6-8 on test theory, matching my own research 
interests. By the summer term of 2013, the book manuscript had substantially been 
extended, and I used it as the sole basis for the Mathematical Statistics lecture. 
During the course of this 2013 lecture, I received many constructive comments 
and suggestions from students and teaching assistants, which led to a further 
improvement of the text. 


Berlin, Germany Vladimir Spokoiny 
Berlin, Germany Thorsten Dickhaus 
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Chapter 1 
Basic Notions 


The starting point of any statistical analysis is data, also called observations or 
a sample. A statistical model is used to explain the nature of the data. A standard 
approach assumes that the data is random and utilizes some probabilistic framework. 
On the contrary to probability theory, the distribution of the data is not known 
precisely and the goal of the analysis is to infer on this unknown distribution. 

The parametric approach assumes that the distribution of the data is known up 
to the value of a parameter 6 from some subset © of a finite-dimensional space 
R?. In this case the statistical analysis is naturally reduced to the estimation of 
the parameter @: as soon as @ is known, we know the whole distribution of the 
data. Before introducing the general notion of a statistical model, we discuss some 
popular examples. 


1.1 Example of a Bernoulli Experiment 


Let Y = (Y%,...,¥,)' be a sequence of binary digits zero or one. We distinguish 
between deterministic and random sequences. Deterministic sequences appear, e.g., 
from the binary representation of a real number, or from digitally coded images, 
etc. Random binary sequences appear, e.g., from coin throw, games, etc. In many 
situations incomplete information can be treated as random data: the classification 
of healthy and sick patients, individual vote results, the bankruptcy of a firm or credit 
default, etc. 
Basic assumptions behind a Bernoulli experiment are: 


¢ the observed data Y; are independent and identically distributed. 
* each Y; assumes the value one with probability 6 € [0, 1]. 


The parameter 9 completely identifies the distribution of the data Y. Indeed, for 
every i <nand y € {0, 1}, 
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PY = y) = 97-8), 


and the independence of the Y;’s implies for every sequence y = (y1,..., Yn) that 


P(Y=y)= [[e"a — 6%, (1.1) 


i=1 


To indicate this fact, we write Pg in place of P. 
Equation (1.1) can be rewritten as 


Po (Y = y) = 6"(1 — 6)" , 


where 


n 
Snr = ) yi- 


i=1 


The value s, is often interpreted as the number of successes in the sequence y. 

Probabilistic theory focuses on the probabilistic properties of the data Y under 
the given measure Pg. The aim of the statistical analysis is to infer on the measure 
Ps for an unknown 6 based on the available data Y. Typical examples of statistical 
problems are: 


1. Estimate the parameter 0, i.e. build a function 6 of the data Y into [0, 1] which 
approximates the unknown value @ as well as possible; 

2. Build a confidence set for 0, i.e. a random (data-based) set (usually an interval) 
containing 0 with a prescribed probability; 

3. Testing a simple hypothesis that 6 coincides with a prescribed value 6p, e.g. 00 = 
1/2; 

4. Testing a composite hypothesis that 6 belongs to a prescribed subset Qo of the 
interval [0, 1]. 


Usually any statistical method is based on a preliminary probabilistic analysis of 
the model under the given 0. 


Theorem 1.1.1. Let Y be i.i.d. Bernoulli with the parameter 0. Then the mean and 
the variance of the sum S, = Y, +... + Yy, satisfy 


Ke S, = no, 
Varg Sy = Eg(S, —E9S,) = n6(1 — 8). 


Exercise 1.1.1. Prove this theorem. 


This result suggests that the empirical mean 6=S, /n is a reasonable estimate of 
0. Indeed, the result of the theorem implies 
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E,6 = 6, Eo(6— 6) = 0(1—8)/n. 


The first equation means that 6 is an unbiased estimate of 0, that is, E96 = 60 for 
all 6. The second equation yields a kind of concentration (consistency) property of 
@: with n growing, the estimate @ concentrates in a small neighborhood of the point 
0. By the Chebyshev inequality 


P4(|6 — 6] > 8) < 0 — 6)/(n8?). 


This result is refined by the famous de Moivre—Laplace theorem. 


Theorem 1.1.2. Let Y be i.i.d. Bernoulli with the parameter 0. Then for every 
k<n 


Po(S, =k) -(;) ak —@y"* 


_ (k =n)” 
. V2nn@(1 — 8) See =) 


where ay, * by, means a,/b, > 1 asn — oo. Moreover, for any fixed z > 0, 
Sn 2 eo 2/2 
Po (|= - 6| > VOC = 8)/n) ~ eee ee 
n V20 Jz 


This concentration result yields that the estimate 6 deviates from a root-n neighbor- 
hood A(z, #) < {u:|0 —u| < z VOC — 0)/n} with probability of order e~/?. 

This result bounding the difference \6 — 6| can also be used to build random 
confidence intervals around the point 6. Indeed, by the result of the theorem, the 
random interval E*(z) = {u: |@—u| < 2/6 — 0)/n} fails to cover the true point 
@ with approximately the same probability: 


Po(E*(z) #0) = =| edt, (1.2) 


Unfortunately, the construction of this interval E*(z) is not entirely data-based. 
Its width involves the true unknown value @. A data-based confidence set can be 
obtained by replacing the population variance o7 e Eo (Y; i- @)” = 6(1 — 8) with 
its empirical counterpart 


rdf ly py? 
eS DK - 4) 


i=1 


4 1 Basic Notions 


The resulting confidence set E'(z) reads as 
E(2 © fu: |6 —ul < ne}. 


It possesses the same asymptotic properties as E*(z) including (1.2). 

The hypothesis that the value @ is equal to a prescribed value 6, e.g. @ = 1/2, 
can be checked by examining the difference |@ — 1/2]. If this value is too large 
compared to on~!/? or with Gn7'/*, then the hypothesis is wrong with high 
probability. Similarly one can consider a composite hypothesis that @ belongs to 
some interval [6;, 62] C [0, 1]. If 9 deviates from this interval at least by the value 
zon—'/? with a large z, then the data significantly contradict this hypothesis. 


1.2 Least Squares Estimation in a Linear Model 


A linear model assumes a linear systematic dependence between the output (also 
called response or explained variable) Y from the input (also called regressor 
or explanatory variable) UY which in general can be multidimensional. The linear 
model is usually written in the form 


E(Y) = v'6* 
with an unknown vector of coefficients 0* = (6*,..., Gs)" Equivalently one 
writes 

Y=W'6*+e (1.3) 


where é stands for the individual error with zero mean: Ke = 0. Such a linear model 
is often used to describe the influence of the response on the regressor W from the 
collection of data in the form of a sample (Y;, YW) fori = 1,...,n. 

Let 0 be a vector of coefficients considered as a candidate for 6*. Then 
each observation Y; is approximated by wr 0. One often measures the quality 
of approximation by the sum of quadratic errors |Y; — wT 6|?. Under the model 
assumption (1.3), the expected value of this sum is 


EY |¥;—¥/ 0? =E) |w)(0* —6)+.e;| = > |v) (@* —6)| + >. Ee?. 


The cross term cancels in view of Ke; = 0. Note that minimizing this expression 
w.r.t. 8 is equivalent to minimizing the first sum because the second sum does not 
depend on @. Therefore, 


= 6". 


argmin EE) > ly; — wv 0|? = argmin ) ||," (6* - @)|° 
6 6 
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In other words, the true parameter vector @* minimizes the expected quadratic error 
of fitting the data with a linear combinations of the ;’s. The least squares estimate 
of the parameter vector 0* is defined by minimizing in @ its empirical counterpart, 


: 2 . 
that is, the sum of the squared errors [¥; _ wr 0 | over all i: 


6 © aromi : y, —wiel’. 
argmin | i | 


i=l 
This equation can be solved explicitly under some condition on the W;’s. Define the 


p Xn design matrix V = (W;,..., W,,). The aforementioned condition means that 
this matrix is of rank p. 


Theorem 1.2.1. Let Y; = wo" + 6; fori = 1,...,n, where e; are independent 
and satisfy Ee; = 0, Ee? = o*. Suppose that the matrix W is of rank p. Then 


6 =(ww') vy, 
where Y = (Y),..., au Moreover, 6 is unbiased in the sense that 
E*6 => 6 * 


and its variance satisfies Var(6) =o" (ww). 
For each vector h € R?, the random value a4 = (h, 6) = h'6 is an unbiased 
estimate of a* = h' 6*: 


Eg« (a) = a* (1.4) 
with the variance 
Var(@) = 07h" (WHT) *h. 
Proof. Define 


def 


00) =)" |¥, wre)’ = ly — vo), 


i=1 

where || y ||? “ >=; y?. The normal equation dQ(@)/d@ = O can be written as 
Ww'¢ = WY yielding the representation of 8. Now the model equation yields 
EgY = W'6* and thus 


Eg+6 = (WW!) Wy ¥ = (WwT) ww o* = 6* 


as required. 
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Exercise 1.2.1. Check that Var(6) = 02(¥WT)"'. 


Similarly one obtains Eg«(a@) = Eg (h'6) = h'0* =a", thatis, a is an unbiased 
estimate of a*. Also 


Var(a@) = Var(h'6) = h! Var(6)h = 07h (WW')h. 


which completes the proof. 


The next result states that the proposed estimate a@ is in some sense the best 
possible one. Namely, we consider the class of all linear unbiased estimates a 
satisfying the identity (1.4). It appears that the variance oh" (WUT) h of a is 
the smallest possible in this class. 


Theorem 1.2.2 (Gauss—Markovy). Let Y; = wr 0* + 6; fori = 1,...,n with 
uncorrelated €; satisfying Ee; = 0 and Es? = o7. Let rank(W) = p. Suppose that 


the value a* © (h, 0 *) = h' 0* is to be estimated for a given vector h € R?. Then 


a= (h, 0) = h'6 is an unbiased estimate of a*. Moreover, @ has the minimal 
possible variance over the class of all linear unbiased estimates of a*. 


This result was historically one of the first optimality results in statistics. It 
presents a lower efficiency bound of any statistical procedure. Under the imposed 
restrictions it is impossible to do better than the LSE does. This and more general 
results will be proved later in Chap. 4. 

Define also the vector of residuals 


If A isa good estimate of the vector @*, then due to the model equation, é is a good 
estimate of the vector e of individual errors. Many statistical procedures utilize this 
observation by checking the quality of estimation via the analysis of the estimated 
vector &. In the case when this vector still shows a nonzero systematic component, 
there is evidence that the assumed linear model is incorrect. This vector can also be 


used to estimate the noise variance 0”. 


Theorem 1.2.3. Consider the linear model Y; = wl e* + ¢; with independent 
homogeneous errors €;. Then the variance 0? = Ee? can be estimated by 


a2 = él? IY — TOI? 
n—p n—p 


and G6? is an unbiased estimate of 0”, that is, Egx6? = o° for all @* ando. 


Theorems 1.2.2 and 1.2.3 can be used to describe the concentration properties of 
the estimate 4 and to build confidence sets based on @ and G, especially if the errors 
€; are normally distributed. 
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Theorem 1.2.4, Let ¥; = V;'0* + ¢ fori = 1,...,n with e; ~ N(O,o7). Let 
rank(W) = p. Then it holds for the estimate @ = h' 6 of a* = h'6* 


a —a* ~ N(0,s7) 


with s* = oh (WWT) th. 


Corollary 1.2.1 (Concentration). [ffor some a > 0, Zy is the 1 — a/2-quantile of 
the standard normal law (i.e., ®(z,) = 1—a/2), then 


Pox(|4—a*| > zs) =a 


Exercise 1.2.2. Check Corollary 1.2.1. 


The next result describes the confidence set for a*. The unknown variance s? is 
replaced by its estimate 


5° 2 e2nT(wwl) th 
Corollary 1.2.2 (Confidence Set). If E(za) & {a : |d—a| < $ za}, then 


Pox (E (Za) Za*) wa. 


1.3. General Parametric Model 


Let Y denote the observed data with values in the observation space Y. In most 
cases, Y € R”, thatis, Y = (%j,..., ae Here n denotes the sample size (number 
of observations). The basic assumption about these data is that the vector Y is a 
random variable on a probability space (Y, B(Y), P), where B(Y) is the Borel o- 
algebra on Y. The probabilistic approach assumes that the probability measure P 
is known and studies the distributional (population) properties of the vector Y. On 
the contrary, the statistical approach assumes that the data Y are given and tries to 
recover the distribution P on the basis of the available data Y. One can say that the 
statistical problem is inverse to the probabilistic one. 

The statistical analysis is usually based on the notion of statistical experiment. 
This notion assumes that a family P of probability measures on (Y, B(Y)) is fixed 
and the unknown underlying measure P belongs to this family. Often this family is 
parameterized by the value 0 from some parameter set ©: P = (P9,0 € ©). The 
corresponding statistical experiment can be written as 


(J. BY), (Pe, 8 € @)). 


The value 0* denotes the “true” parameter value, that is, P = Py». 
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The statistical experiment is dominated if there exists a dominating o-finite 
measure fly such that all the Pg» are absolutely continuous w.r.t. flo. In what 
follows we assume without further mention that the considered statistical models 
are dominated. Usually the choice of a dominating measure is unimportant and any 
one can be used. 

The parametric approach assumes that © is a subset of a finite-dimensional 
Euclidean space R/”. In this case, the unknown data distribution is specified by 
the value of a finite-dimensional parameter 0 from © C R?. Since in this case 
the parameter 0 completely identifies the distribution of the observations Y, the 
statistical estimation problem is reduced to recovering (estimating) this parameter 
from the data. The nice feature of the parametric theory is that the estimation 
problem can be solved in a rather general way. 


1.4 Statistical decision problem. Loss and Risk 


The statistical decision problem is usually formulated in terms of game theory, the 
statistician playing as it were against nature. Let D denote the decision space that 
is assumed to be a topological space. Next, let go(-,-) be a loss function given on 
the product D x ©. The value go(d, 0) denotes the loss associated with the decision 
d € D when the true parameter value is 9 € ©. The statistical decision problem 
is composed of a statistical experiment (Y, B(Y), P), a decision space D and a loss 
function g0(-,-). 

A statistical decision p = p(Y) is a measurable function of the observed data Y 
with values in the decision space D. Clearly, o(Y ) can be considered as a random D- 
valued element on the space (Y, B(Y)). The corresponding loss under the true model 
(Y, B(Y), Pg*) reads as 9(p(Y), 0”). Finally, the risk is defined as the expected 
value of the loss: 


«\ def * 
R(p, 0") = Egx (p(Y), 0"). 
Below we present a list of typical statistical decision problems. 


Example 1.4.1 (Point Estimation Problem). Let the target of analysis be the true 
parameter 6% itself, that is, D coincides with ©. Let go(-,-) be a kind of distance 
on @, that is, (0, 0*) denotes the loss of estimation, when the selected value is 0 
while the true parameter is 9*. Typical examples of the loss function are quadratic 
loss 0(0,0*) = ||}@ —0*||7, 1\-loss 9(0,0*) = ||@ —O* ||, or sup-loss (0, 0*) = 
|8 — O* loo = maxj=1,...p |9; — 0; |. 

If 6 is an estimate of 0 * that is, 6 is a @-valued function of the data Y, then the 
corresponding risk is 


R(p, 0*) = By+0(6, 6"). 


Particularly, the quadratic risk reads as Kg« \|0 —6* |. 
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Example 1.4.2 (Testing Problem). Let @o and ©, be two complementary subsets 
of O, that is, 09 N ©; = BY, Op U ©; = O. Our target is to check whether the true 
parameter 6* belongs to the subset @o. The decision space consists of two points 
{0, 1} for which d = 0 means the acceptance of the hypothesis Hy : 0* € © while 
d = 1 rejects Hp in favor of the alternative H, : 0* € ©. Define the loss 


e(d,0) = 1(d = 1,6 € @) +1(d =0,0 € @)). 


A test @ is a binary valued function of the data, ¢ = (VY) € {0,1}. The 
corresponding risk R(¢, 0*) = Eg+(Y) can be interpreted as the probability of 
selecting the wrong subset. 


Example 1.4.3 (Confidence Estimation). Let the target of analysis again be the 
parameter 0*. However, we aim to identify a subset A of ©, as small as possible, 
that covers with a prescribed probability the true value @*. Our decision space D 
is now the set of all measurable subsets in ©. For any A € *D, the loss function is 
defined as (A, 0*) = 1(A 9 6*). A confidence set is a random set € selected from 
the data Y, € = €(Y). The corresponding risk R(E, 0*) = Eg» go(E, O*) is just the 
probability that € does not cover @*. 


Example 1.4.4 (Estimation of a Functional). Let the target of estimation be a given 
function {(@*) of the parameter 0* with values in another space F. A typical 
example is given by a single component of the vector 0*. An estimate p of f(0*) 
is a function of the data Y into F: p = p(Y) € F. The loss function go is 
defined on the product F x F, yielding the loss g(p(Y), f(@*)) and the risk 


R(p(Y), f(8")) = Egx p(o(¥), f(0*)). 


Exercise 1.4.1. Define the statistical decision problem for testing a simple hypoth- 
esis 0* = 00 for a given point Oo. 


1.5 Efficiency 


After the statistical decision problem is stated, one can ask for its optimal solution. 
Equivalently one can say that the aim of statistical analysis is to build a decision 
with the minimal possible risk. However, a comparison of any two decisions on the 
basis of risk can be a nontrivial problem. Indeed, the risk R(p, 6*) of a decision p 
depends on the true parameter value 0*. It may happen that one decision performs 
better for some points 6* € © but worse at other points @*. An extreme example of 
such an estimate is the trivial deterministic decision 6 = 6 which sets the estimate 
equal to the value 09 whatever the data is. This is, of course, a very strange and 
poor estimate, but it clearly outperforms all other methods if the true parameter 0* 
is indeed 6. 

Two approaches are typically used to compare different statistical decisions: 
the minimax approach considers the maximum R(p) of the risks R(p, 8) over the 
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parameter set © while the Bayes approach is based on the weighted sum (integral) 
Rx (p) of such risks with respect to some measure z on the parameter set © which 
is called the prior distribution: 


R(p) = sup R(p, A), 
6€O 


Ra(p) = | Rp. )z(d8). 
The decision p* is called minimax if 
R(p*) = inf R(p) = inf sup R(p, 8), 
p P 960 


where the infimum is taken over the set of all possible decisions p. The value R* = 
R(p*) is called the minimax risk. 
Similarly, the decision pz is called Bayes for the prior z if 


Ra (Px) = i Rx (p). 


The corresponding value R, (p,,) is called the Bayes risk. 


Exercise 1.5.1. Show that the minimax risk is greater than or equal to the Bayes 
risk whatever the prior measure z is. 
Hint: show that for any decision p, it holds R(p) > Rz(p). 


Usually the problem of finding a minimax or Bayes estimate is quite hard and a 
closed form solution is available only in very few special cases. A standard way out 
of this problem is to switch to an asymptotic setup in which the sample size grows 
to infinity. 


Chapter 2 
Parameter Estimation for an i1.i.d. Model 


This chapter is very important for understanding the whole book. It starts with very 
classical stuff: Glivenko—Cantelli results for the empirical measure that motivate 
the famous substitution principle. Then the method of moments is studied in 
more detail including the risk analysis and asymptotic properties. Some other 
classical estimation procedures are briefly discussed including the methods of 
minimum distance, M-estimates, and its special cases: least squares, least absolute 
deviations, and maximum likelihood estimates (MLEs). The concept of efficiency 
is discussed in context of the Cramér—Rao risk bound which is given in univariate 
and multivariate case. The last sections of Chap. 2 start a kind of smooth transition 
from classical to “modern” parametric statistics and they reveal the approach of the 
book. The presentation is focused on the (quasi) likelihood-based concentration and 
confidence sets. The basic concentration result is first introduced for the simplest 
Gaussian shift model and then extended to the case of a univariate exponential 
family in Sect. 2.11. 

Below in this chapter we consider the estimation problem for a sample of 
independent identically distributed (i.1.d.) observations. Throughout the chapter the 
data Y are assumed to be given in the form of a sample (Y,..., ae We assume 
that the observations Y;,..., Y, are i.i.d.; each Y; is from an unknown distribution 
P,, also called a marginal measure. The joint data distribution P is the n-fold product 
of P: P = P®". Thus, the measure P is uniquely identified by P and the statistical 
problem can be reduced to recovering P. 

The further step in model specification is based on a parametric assumption (PA): 
the measure P belongs to a given parametric family. 


2.1 Empirical Distribution: Glivenko—Cantelli Theorem 


Let Y = (Y,..., ¥,)" be an i.i.d. sample. For simplicity we assume that the Y;’s 
are univariate with values in R. Let P denote the distribution of each Y;: 
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P(B)=P(%;€B), BeB(R). 


One often says that Y is ani.i.d. sample from P. Let also F be the corresponding 
distribution function (cdf): 


F(y) = P(™% < y) = P((-o, y)). 


The assumption that the Y;’s are i.i.d. implies that the joint distribution P of the data 
Y is given by the n-fold product of the marginal measure P: 


P= P®", 


Let also P, (resp. F,) be the empirical measure (resp. empirical distribution 
function (edf)) 


1 1 
Pa(B)=— SAG € B) — Fr(y) = = DU Sy). 


Here and everywhere in this chapter the symbol }~ stands for )~’_,. One can 
consider F;, as the distribution function of the empirical measure P,, defined as the 
atomic measure at the Y;’s: 


def 


Py(Ay # LSU € A), 


i=1 


So, P,,(A) is the empirical frequency of the event A, that is, the fraction of 
observations Y; belonging to A. By the law of large numbers one can expect that 
this empirical frequency is close to the true probability P(A) if the number of 
observations is sufficiently large. 

An equivalent definition of the empirical measure and edf can be given in terms 
of the empirical mean Eg for a measurable function g: 


le > ie 1 . 
Eng = g(y)Pn (dy) -| g(y) dF y(y) = — 8h). 


i=1 


The first results claims that indeed, for every Borel set B on the real line, the 
empirical mass P,,(B) (which is random) is close in probability to the population 
counterpart P(B). 


Theorem 2.1.1. For any Borel set B, it holds 


1. EP, (B) = P(B). 

2. Var{P,(B)} = n7!o% with og = P(B){1— P(B)}. 
3. P,(B) > P(B) in probability as n > ov. 

4. /n{P,(B) — P(B)} —> N(0, 09). 
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Proof. Denote & = 1(Y; € B). This is a Bernoullir.v. with parameter P(B) = KEé;. 
The first statement holds by definition of P,(B) = n7! &. Next, foreachi <n, 


def 


Var & = Hs? — (E&)’ = P(B){1 — P(B)} 


in view of &? = &. Independence of the &;’s yields 
Var{ P,(B)} = Var( 1! >» i) =n” S- Var éj =n 'o% 
i=l i=l 
The third statement follows by the law of large numbers for the i.i.d. r.v. &;: 
i P 
= ». §; —_ TEE, . 
n 
i=l 
Finally, the last statement follows by the Central Limit Theorem for the &;: 
(6 — Bé;) —> N(0, 03). 
i i=1 


The next important result shows that the edf F,, is a good approximation of the 
cdf F in the uniform norm. 


Theorem 2.1.2 (Glivenko—Cantelli). Jt holds 


sup| F,(y) — F(y)| > 0, n—> oo 
7 


Proof. Consider first the case when the function F is continuous in y. Fix any 
integer N and define with e = 1/N the points t) <<... <ty = +00 such that 
F(t;) — F(tj-1) = ¢ for 7 = 2,..., N. For every j, by (3) of Theorem 2.1.1, it 
holds F,,(t;) — F(t;). This implies that for some n(NV), it holds for all n > n(N) 
with a probability at least 1 — e 


|Fn(t;) -— F(t))| <¢, f=lewid: (24) 
Now for every ¢ € [t;-1, ¢;], it holds by definition 
F(tj-1) < F(t) < F(t;), Fi(tj-1) < Fit) < Fu(t;). 
This together with (2.1) implies 


P(|F,(t) — F(t)| > 2e) <e 
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If the function F(-) is not continuous, then for every positive ¢, there exists a finite 
set S, of points of discontinuity s,, with F(s,,) — F(s,, — 0) => ¢. One can proceed 
as in the continuous case by adding the points from 8S, to the discrete set {7; }. 


Exercise 2.1.1. Check the details of the proof of Theorem 2.1.2. 


The results of Theorems 2.1.1 and 2.1.2 can be extended to certain functionals of 
the distribution P. Let g(y) be a function on the real line. Consider its expectation 


CO 


= Eg) =f g(y)dF(y). 


Its empirical counterpart is defined by 
det [~ i 
8,2 [ go)aFiv) = = ett. 
= i=1 


It appears that S,, indeed well estimates so, at least for large n. 


Theorem 2.1.3. Let g(y) be a function on the real line such that 


[e2) 
[#0 aro <x 
—oo 
Then 
Si a4 50, (nS, = 50) —> N(O, 07), n> o, 


where 
ae] #0) dry) 98 =f [s(v) — so] dF(y). 


Moreover, if h(z) is a twice continuously differentiable function on the real line, and 
h'(so) € 0, then 


h(Sn) —> hs), Vn {h(S,) — h(s0)} —> NO,o2), 2 > 00, 


def 
where 07 = |h'(so)|?a. 


Proof. The first statement is again the CLT for the i.i.d. random variables €&; = g(Y;) 
having mean value so and variance ay. 
It also implies the second statement in view of the Taylor expansion h(S,,) — 


h(So) x h'(so) (Sn — So). 
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Exercise 2.1.2. Complete the proof. 

Hint: use the first result to show that S,, belongs with high probability to a small 
neighborhood U of the point so. 

Then apply the Taylor expansion of second order to h(S,) — h(so) = h(so + 


n~'7&,) — h(s0) with § = V/n(Sn — 50): 


|n/? [h(Sn) — h(s0)] — A! (0) &n| <n? HE? /2, 


P 
where H* = maxy |h""(y)|. Show that n—!/?&? —+ 0 because &, is stochastically 
bounded by the first statement of the theorem. 


The results of Theorems 2.1.2 and 2.1.3 can be extended to the case of a vectorial 
function g(-): R! > R”, that is, g(y) = (gi()),.. -&m(y)) | for y € R!. Then 
So = (S01... Som)! and its empirical counterpart S, = (Sj4,..., Seay are 
vectors in JR” as well: 


det [~~ def ; 
vo # | £)()) dF), | eiQ)dFa(y), f= Lye git 
—-cCoO —cCoO 


Theorem 2.1.4. Let g(y) be an IR" -valued function on the real line with a bounded 
covariance matrix & = (Xjx) j,k=1,....m! 


Se = / [gi(v) — 50; ][ge(v) — Son] dF(y) <00, j,k <m 


Then 
S;, ee, J/n(Sn — 80) —"+ N(0, 5), n— ©o. 


Moreover, if H(z) is a twice continuously differentiable function on IR” and 
= H'(so) 4 0 where H’(z) stands for the gradient of H at z, then 


H(S,) —> H(s), — Va{H(S,) — H(s0)} ~> N,02), n> 0, 


where 07, % '(so)’ X H"(so). 


Exercise 2.1.3. Prove Theorem 2.1.4. 
Hint: consider for every h € IR” the scalar products h'g(y), h'so, h' Sy. 
For the first statement, it suffices to show that 


h™S,—>h'so,  Jnh™(S;,—so) > N(O,h™Zh), n> 00. 
For the second statement, consider the expansion 


P 
jn? [H(S,) — H(so)] — &) H’(so)| <n! H* |I&,||?/2 — 0, 
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with &, = n'/?(S,, — so) and H* = maxyey ||H"(y)|| for a neighborhood U of 
So. Here || A|| means the maximal eigenvalue of a symmetric matrix A. 


2.2 Substitution Principle: Method of Moments 


By the Glivenko—Cantelli theorem the empirical measure P,, (resp. edf F;,) is a good 
approximation of the true measure P (resp. pdf F’), at least, if n is sufficiently large. 
This leads to the important substitution method of statistical estimation: represent 
the target of estimation as a function of the distribution P, then replace P by P,,. 

Suppose that there exists some functional g of a measure Pg from the family 
P = (P¢,0 € ©) such that the following identity holds: 


0 


g(Po), 6€0. 


This particularly implies 0* = g(Pg*) = g(P). The substitution estimate is 
defined by substituting P,, for P: 


Sometimes the obtained value 6 can lie outside the parameter set ©. Then one can 
redefine the estimate @ as the value providing the best fit of g (P,,): 


6 = argmin |g (Po) — (Py) II. 


Here || - || denotes some norm on the parameter set ©, e.g. the Euclidean norm. 


2.2.1 Method of Moments: Univariate Parameter 


The method of moments is a special but at the same time the most frequently used 
case of the substitution method. For illustration, we start with the univariate case. 
Let © C R, that is, @ is a univariate parameter. Let g(y) be a function on R such 
that the first moment 


mo)! Belt) = / e(y) dPo(y) 


is continuous and monotonic. Then the parameter 0 can be uniquely identified by 
the value m(6), that is, there exists an inverse function m7! satisfying 


0= m(f a4) dPo(9)). 
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The substitution method leads to the estimate 


6 = m(f (0) 4P,(9)) = m( Ya). 


Usually g(x) = x or g(x) = x7, which explains the name of the method. This 
method was proposed by Pearson and is historically the first regular method of 
constructing a statistical estimate. 


2.2.2 Method of Moments: Multivariate Parameter 


The method of moments can be easily extended to the multivariate case. Let © C 


R?, and let g(y) = (gi(y), ee Sp(¥)) | be a function with values in R’. Define 
the moments m(@) = (m\ (0), secs ,m,(0)) by 


OS / g)(y) dPy(y). 


The main requirement on the choice of the vector function g is that the function m 
is invertible, that is, the system of equations 


m;(0) = t; 


has a unique solution for any t € R?. The empirical counterpart M, of the true 
moments m(6*) is given by 


‘ 1 1 ¥ 
M, = ev aPa() = (5 Dai e Dogel)) 


Then the estimate 6 can be defined as 


5 def fl 1 
6 = m'(M,) =m (= Dae. 2 got). 


2.2.3 Method of Moments: Examples 


This section lists some widely used parametric families and discusses the problem 
of constructing the parameter estimates by different methods. In all the examples 
we assume that an i.i.d. sample from a distribution P is observed, and this measure 
P belongs to a given parametric family (Pg, 0 € Q), that is, P = Pox for@* € O. 
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2.2.3.1 Gaussian Shift 


Let Ps be the normal distribution on the real line with mean 6 and the known 
variance o”. The corresponding density w.r.t. the Lebesgue measure reads as 


1 
O)= 
POR) = Tez 


It holds Eg Y; = 0 and Varg(Y\) = o leading to the moment estimate 


~ 1 
6= dP), =— Y; 
/ y aP,(y) : ) 
with mean E96 = @ and variance 


Varg (0) =o7/n. 


2.2.3.2 Univariate Normal Distribution 
Let Y; ~ N(a,o7) as in the previous example but both mean a and the variance 
o” are unknown. This leads to the problem of estimating the vector 8 = (6), 02) = 
(a, 0”) from the iid. sample Y. 

The method of moments suggests to estimate the parameters from the first two 
empirical moments of the Y;’s using the equations m,(0) = Eg Y; = a, m2(0) = 
E9Y? = a? + o”. Inverting these equalities leads to 


a=m,(6), ao”? = m2(0) — m3 (6). 


Substituting the empirical measure P,, yields the expressions for 6: 


1 I I al 2 
@=-)'Y¥, @&=-) ¥P-(-) ¥,) =->) (%-a). (22 
n me n os : (; SB ) n ye ) oP 
As previously for the case of a known variance, it holds under P = Pg: 
Ea = a, Varg(@) = 07 /n. 


However, for the estimate 6? of o7, the result is slightly different and it is described 
in the next theorem. 


Theorem 2.2.1. It holds 


Epa? = ——o’, Varg (G7) = 
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Proof. We use vector notation. Consider the unit vectore = n7!/2(1,...,1)' € R” 
and denote by IT, the projector on e: 


Th =(e'h)e. 


Then by definition @ = n~'/?e 'T1,Y¥ and 6? = n7!|\¥ — I, Y||?. Moreover, the 
model equation Y = n!/?we + e implies in view of Ile = e that 


T1,Y = (nae + Tle). 
Now 
no” = ||¥ —T,¥|)? = lle — Mell? = |]n — Well’ 


where /,, is the identity operator in IR” and J,, — IT, is the projector on the hyperplane 
in R” orthogonal to the vector e. Obviously (J, — Il,)e is a Gaussian vector with 
zero mean and the covariance matrix V defined by 
V = E[(, — Mee! Un — 1) = Un — 11) E(ee')Un — M1) 
= o° (I, — Tl)? = 0’, — T)). 


It remains to note that for any Gaussian vector € ~ N(0, V) it holds 
Big? =tV, — Var(| ||?) = 2tr(V). 


Exercise 2.2.1. Check the details of the proof. 
Hint: reduce to the case of diagonal V. 


Exercise 2.2.2. Compute the covariance E(@ — w)(@7 — 07) . Show that @ and 6? 
are independent. 

Hint: represent @ — a = n7'/?e' Tie and a? = n~!|\(1, — TH) )ell?. Use that 
Tle and (/,, — I], )e are independent if IT, is a projector and e is a Gaussian vector. 


2.2.3.3 Uniform Distribution on [0, 6] 
Let Y; be uniformly distributed on the interval [0, @] of the real line where the right 


end point @ is unknown. The density p(y, 0) of Po w.r.t. the Lebesgue measure is 
6—'1(y < @). It is easy to compute that for an integer k 


0 
Eo(Y*) = of yk dy = OF /(k +1), 
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or 6 = {(k + 1)Eo (Y*)} '/k This leads to the family of estimates 


7 ped I/(k+1) 
&, = | —— yy ‘ 
ee) 
Letting k to infinity leads to the estimate 


Boo = max{Yj,..., Yn}. 


This estimate is quite natural in the context of the univariate distribution. Later it 
will appear once again as the MLE. However, it is not a moment estimate. 


2.2.3.4 Bernoulli or Binomial Model 
Let Pe be a Bernoulli law for 6 € [0, 1]. Then every Y; is binary with 
Eg Y; = 0. 


This leads to the moment estimate 


~ 1 
b= f yar.) =—o¥. 


Exercise 2.2.3. Compute the moment estimate for g(y) = y*,k > 1. 


2.2.3.5 Multinomial Model 
The multinomial distribution B;’ describes the number of successes in m experi- 
ments when each success has the probability 6 € [0,1]. This distribution can be 


viewed as the sum of m binomials with the same parameter 6. Observed is the 
sample Y where each Y; is the number of successes in the ith experiment. One has 


PY) = (") eka oy", kk =0,...,m. 
Exercise 2.2.4. Check that method of moments with g(x) = x leads to the estimate 


ef 
sere le 


Compute Varg (6 ). 
Hint: Reduce the multinomial model to the sum of m Bernoulli. 
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2.2.3.6 Exponential Model 


Let Pg be an exponential distribution on the positive semiaxis with the parameter 0. 
This means 


Pa(%1 > y) =e?/?, 


Exercise 2.2.5. Check that method of moments with g(x) = x leads to the estimate 


~ 1 
eo ee 


Compute Varo ( 6 ). 


2.2.3.7 Poisson Model 


Let Pg be the Poisson distribution with the parameter 6. The Poisson random 
variable Y; is integer-valued with 
P. 6 (% = k) = rk é 


Exercise 2.2.6. Check that method of moments with g(x) = x leads to the estimate 


1 
=— Ns 


Dr 


Compute Varg ( 6 ). 


2.2.3.8 Shift of a Laplace (Double Exponential) Law 
Let Po be a symmetric distribution defined by the equations 

Po((Mil>y) =e", = -y 20, 
for some given o > 0. Equivalently one can say that the absolute value of Y; is 
exponential with parameter o under Po. Now define Py, by shifting Po by the 
value @. This means that 


Po(\¥1 -0| > y) =e”, y>0. 


It is obvious that Eo ¥,; = 0 and Eg Y; = 0. 
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Exercise 2.2.7. Check that method of moments leads to the estimate 
gS 2 AY 
= a 


Compute Varo ( 6 ). 


2.2.3.9 Shift of a Symmetric Density 
Let the observations Y; be defined by the equation 
Y; = 0* + ¢; 
where 0* is an unknown parameter and the errors ¢; are iid. with a density 
symmetric around zero and finite second moment o? = Eéev. This particularly 


yields that Ee; = 0 and EY; = 0*. The method of moments immediately yields the 
empirical mean estimate 


ead 
ap aps 


with Varg(9) = 02/n. 


2.3. Unbiased Estimates, Bias, and Quadratic Risk 


Consider a parametric i.i.d. experiment corresponding to a sample Y =(¥,..., 
Y,)' from a distribution Py» € (Pe,@ € © C R?). By 0* we denote the true 
parameter from ©. Let 6 be an estimate of 6 * that is, a function of the available 
data Y with values in ©: 6 = 0(Y). 

An estimate 6 of the parameter 6* is called unbiased if 


Eo+0 = 0". 


This property seems to be rather natural and desirable. However, it is often just 
matter of parametrization. Indeed, if g : © — © is a linear transformation of the 
parameter set ©, that is, g(0) = AO + J, then the estimate } © 46 +5 of the 
new parameter 0 = A@ + b is again unbiased. However, if m/(-) is a nonlinear 
transformation, then the identity Ey«m(@) = m(6*) is not preserved. 


Example 2.3.1. Consider the Gaussian shift experiments for Y; i.i.d. N(0*, o7) with 
known variance o? but the shift parameter 0* is unknown. Then 6 = n7!(Y,+...+ 
Y,,) is an unbiased estimate of 0*. However, for (0) = 67, it holds 
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Eo=|6|? = |6*|? + 07/n, 


that is, the estimate |@|* of |@*|? is slightly biased. 


The property of “no bias” is especially important in connection with the quadratic 
risk of the estimate @. To illustrate this point, we first consider the case of a 
univariate parameter. 


2.3.1 Univariate Parameter 


Let 6 € © C R!. Denote by Var(6) the variance of the estimate 6: 
Var» (6) = ox (6 _ Tg«6)°. 


The quadratic risk of 6 is defined by 


RG, 0*) = Eyx|6 — 0" |”. 


It is obvious that RO, 0*) = Varg« (6) if 6 is unbiased. It turns out that the quadratic 
risk of 6 is larger than the variance when this property is not fulfilled. Define the 
bias of 6 as 


(6, 6*) © Eg«6 — 0*. 
Theorem 2.3.1. It holds for any estimate 6 of the univariate parameter 6* : 
R(6, 0*) = Varg«(8) + b2(6, 0"). 
Due to this result, the bias b(6, 0*) contributes the value b2(0, 0*) in the quadratic 


risk. This particularly explains why one is interested in considering unbiased or at 
least nearly unbiased estimates. 


2.3.2. Multivariate Case 


Now we extend the result to the multivariate case with 0 € © C R?. Then 0 isa 
vector in R?. The corresponding variance—covariance matrix Varg* (@) is defined as 


Varg«(6) © Eyx[(O —Eg+8)(0 — Ey+6) J. 


As previously, @ is unbiased if E,»6 = 6*, and the bias of 6 is b(6,6*) < 
Ey+0—0*. 
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The quadratic risk of the estimate 6 in the multivariate case is usually defined via 
the Euclidean norm of the difference 0 — 0*: 


def 


RO, 0*) = Egs|6 —0* |. 


Theorem 2.3.2. It holds 
RO, O*) = tr Vargx(8)] + |b(,0*)|° 


Proof. The result follows similarly to the univariate case using the identity ||v||? = 
tr(vv') for any vector v € R?. 


Exercise 2.3.1. Complete the proof of Theorem 2.3.2. 


2.4 Asymptotic Properties 


The properties of the previously introduced estimate 6 heavily depend on the sample 
size n. We therefore use the notation 0, to highlight this dependence. A natural 
extension of the condition that 6 is unbiased is the requirement that the bias b (6 ,6*) 
becomes negligible as the sample size n increases. This leads to the notion of 
consistency. 


Definition 2.4.1. A sequence of estimates 6, is consistent if 


~ Pp 
0, — 6* n— oo. 
6,, is mean consistent if 
Eg+||0, — 0*|| > 0, n—> oo. 


Clearly mean consistency implies consistency and also asymptotic unbiasedness: 
a * a x P 
b(0,,0°) = Ed, —0* —-0, n> o. 
The property of consistency means that the difference 6 —0* is small forn large. The 
next natural question to address is how fast this difference tends to zero with n. The 


Glivenko—Cantelli result suggests that ./n (6, -—0 *) is asymptotically normal. 


Definition 2.4.2. A sequence of estimates 6,, is rootw normal if 
Vn(6, —0*) > N(0, V?) 


for some fixed matrix V. 
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We aim to show that the moment estimates are consistent and asymptotically 
root” normal under very general conditions. We start again with the univariate case. 


2.4.1 Root-n Normality: Univariate Parameter 


Our first result describes the simplest situation when the parameter of interest 0* 
can be represented as an integral { g(y)dP9«(y) for some function g(-). 


Theorem 2.4.1. Suppose that © C R and a function g(-) : R > R satisfies for 
every0 EO 


feo dPo(y) = 9, 
[leo - oF aren) = 0) <0v. 


Then the moment estimates 6, =n! g(¥;) satisfy the following conditions: 


I. each 6, is unbiased, that is, B=, = 6". 
2. the normalized quadratic risk nEg= (0, — o*)” fulfills 


nlEg« (Bn = ax) =07(6"*). 


3. 6, is asymptotically rootn normal: 
Vn(6, — 0*) —> NO, 07(6*)). 


This result has already been proved, see Theorem 2.1.3. Next we extend this 
result to the more general situation when 6* is defined implicitly via the moment 
so(0*) = f g(y) dPo*(y). This means that there exists another function m(@*) such 
that m(6*) = [ g(y) dPo=(y). 


Theorem 2.4.2. Suppose that © © R and a functions g(y) : R > Rand m(6) : 
© > R satisfy 


[ ey aroty) =m). 
fie —m(6)}" dPs(y) = a7(8) <0. 


We also assume that m(-) is monotonic and twice continuously differentiable with 
m'(m(0*)) # 0. Then the moment estimates 0, = m~'(n~! ¥> g(¥;)) satisfy the 
following conditions: 
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1 bn is consistent, that is, 6, sy 6*. 
2. 0, is asymptotically rootn normal: 

Vn(6, — 0*) —> N(O, 0?(6*)), (2.3) 
where o7(0*) = |m'(m(0*))|-*02 (6*). 


This result also follows directly from Theorem 2.1.3 with h(s) = m7!(s). 
The property of asymptotic normality allows us to study the asymptotic concen- 
tration of @, and to build asymptotic confidence sets. 


Corollary 2.4.1. Let 6, be asymptotically rootn normal: see (2.3). Then for any 
z>0 


lim Pox (/n|6n — 6*| > za(6*)) = 20(—z) 


where ®(z) is the cdf of the standard normal law. 
In particular, this result implies that the estimate 6, belongs to a small rootn 
neighborhood 


© [6* — 17/0 (0")z, 0* +0720 (6*)z] 


A(z) 
with the probability about 2@(—z) which is small provided that z is sufficiently 
large. 

Next we briefly discuss the problem of interval (or confidence) estimation of 
the parameter 6*. This problem differs from the problem of point estimation: the 
target is to build an interval (a set) Ey on the basis of the observations Y such 
that P(E, > 0*) » 1—a fora givena € (0,1). This problem can be attacked 
similarly to the problem of concentration by considering the interval of width 
20 (0*)z centered at the estimate 6. However, the major difficulty is raised by the fact 
that this construction involves the true parameter value 9* via the variance o7(6*). 
In some situations this variance does not depend on 6*: o7(0*) = o7 with a known 
value o”. In this case the construction is immediate. 


Corollary 2.4.2. Let 6, be asymptotically rootn normal: see (2.3). Let also 
o?(0*) = o*. Then for any a € (0, 1), the set 


E° (zy) = (6, =n Pez, O, + n2ez,], 
where Zy is defined by 2®(—Zy) = a, satisfies 


Jim, Pox (E(Z) 3 6*)) = 1a. (2.4) 
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Exercise 2.4.1. Check Corollaries 2.4.1 and 2.4.2. 


Next we consider the case when the variance o*(9*) is unknown. Instead we 
assume that a consistent variance estimate 6” is available. Then we plug this 
estimate in the construction of the confidence set in place of the unknown true 
variance 07(@*) leading to the following confidence set: 


E (Za) 2 (6, — no? Ezy, On +n zq]. (2.5) 


Theorem 2.4.3. Let 6, be asymptotically rootn normal: see (2.3). Let 0(0*) > 0 
and G? be a consistent estimate of o7(0*) in the sense that 6” a o?(0*). Then for 
any a € (0, 1), the set E(Zq) is asymptotically a-confident in the sense of (2.4). 


One natural estimate of the variance o(9*) can be obtained by plugging in the 
estimate 6 in place of 6* leading to ¢ = o(0). If o(@) is a continuous function of 
@ in a neighborhood of 6*, then consistency of # implies consistency of o. 


Corollary 2.4.3. Let 6, be asymptotically rootn normal and let the variance o7(@) 


be a continuous function of @ at 0*. Then o e o(6,) is a consistent estimate of 
o(@*) and the set E(zy) from (2.5) is asymptotically a-confident. 


2.4.2 Root-n Normality: Multivariate Parameter 


Let now © C R? and 6% be the true parameter vector. The method of moments 
requires at least p different moment functions for identifying p parameters. Let 
g(y) : R > R? be a vector of moment functions, g(y) = (gi(y),... ,8p(9)) 
Suppose first that the true parameter can be obtained just by integration: 0* = 


J g(y) dPo*(y). This yields the moment estimate 0, = n~' > g(Y;). 


Theorem 2.4.4. Suppose that a vector-function g(y) : R — R? satisfies the 
following conditions: 


[ eorarany =9. 
[iso — 0\g(y) — 0} dPo(y) = E(8). 


Then it holds for the moment estimate 6, =n! Y g(%): 
1. 6 is unbiased, that is, Ey*6 = 6%. 


2. 0, is asymptotically rootn normal: 


Jn(6, — 0*) > NO, E(0*)). (2.6) 
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6, — O*||” fulfills 


3. the normalized quadratic risk nKo« 


nlKe« 6, _ 6* | = tr (6*). 


Similarly to the univariate case, this result yields corollaries about concentration 
and confidence sets with intervals replaced by ellipsoids. Indeed, due to the second 
statement, the vector 


&, = /nfd(0*)}-!/7(0 — 0*) 


is asymptotically standard normal: &,, mule — ~ N(0, I,). This also implies that the 
squared norm of &, is asymptotically y“,-distributed where &* is the law of ||&||? = 


é ; +...+& oe Define the value zy via the quantiles of re by the relation 
P(IIEl| > za) = a. (2.7) 


Corollary 2.4.4. Suppose that 6, is rootn normal, see (2.6). Define for a given z 
the ellipsoid 


A(z) = {6 : (0 — 0*)"{5(0*)}-1(0 — 0*) < 22/n}. 


Then A(Zq) is asymptotically (1 — a)-concentration set for 6, in the sense that 


lim P(6 ¢ A(za)) = a. 


The weak convergence &,, = € suggests to build confidence sets also in form of 
ellipsoids with the axis defined by the covariance matrix ©(@*). Define for a > 0 


E° (eq) = 


{8 Jnl {=(0*)}/7(6 — 8)|| < za}. 
The result of Theorem 2.4.4 implies that this set covers the true value 0* with 
probability approaching | — a. 

Unfortunately, in typical situations the matrix =(6*) is unknown because it 
depends on the unknown parameter 6*. It is natural to replace it with the matrix 
x(0) replacing the true value @* with its consistent estimate 6. If xu(0@) is a 
continuous function of 0, then x(0) provides a consistent estimate of &(0*). This 
leads to the data-driven confidence set: 


def 


E(z) = {0: J/n||{2(6)}"'°(6 — @)|| < 2. 
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Corollary 2.4.5. Suppose that 6, is rootn normal, see (2.6), with a non-degenerate 
matrix X(0*). Let the matrix function X(@) be continuous at 0*. Let zy be 
defined by (2.7). Then E° (zy) and E'(Zy) are asymptotically (1 — a)-confidence sets 
foro”: 


lim P(E° (zg) > 0*) = lim P(E(zq) 3 0*) = 1—a. 
noo 


noo 


Exercise 2.4.2. Check Corollaries 2.4.4 and 2.4.5 about the set E°(z,). 


Exercise 2.4.3. Check Corollary 2.4.5 about the set F (za). 
Hint: 8 is consistent and (6) is continuous and invertible at 0*. This implies 


B6)-ZO)—>0, {2G}! - WE! So, 


and hence, the sets E°(z,) and E'(z,) are nearly the same. 


Finally we discuss the general situation when the target parameter is a function 
of the moments. This means the relations 


m(0) = [ e(y)dP oy). @ = m!(m(0)). 


Of course, these relations assume that the vector function m(-) is invertible. The 
substitution principle leads to the estimate 


4 def 1 
?=m (M n)s 
where M_,, is the vector of empirical moments: 


5 1 
v= f gQ)aPu) = — Da %). 


The central limit theorem implies (see Theorem 2.1.4) that M,, is a consistent 
estimate of m(0*) and the vector ./n [M n—m(0* )] is asymptotically normal with 
some covariance matrix U,(0*). Moreover, if m~' is differentiable at the point 
m(0*), then /n(@ — 0*) is asymptotically normal as well: 


Vn(O — 0*) —> NO, D(6*)) 


where 5(0*) = H'D,(0*)H and H is the p x p-Jacobi matrix of m7 at m(0*): 


H = 4m" (m(6*)). 
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2.5 Some Geometric Properties of a Parametric Family 


The parametric situation means that the true marginal distribution P belongs to 
some given parametric family (P»,@ € © C R?). By 6* we denote the true value, 
that is, P = Pg» € (Po). The natural target of estimation in this situation is the 
parameter 6* itself. Below we assume that the family (P9) is dominated, that is, 
there exists a dominating measure j19. The corresponding density is denoted by 


dP» 
D(y,6)= Ai 
Lo 
We also use the notation 


def 
l(y, 0) = log p(y, 4) 
for the log-density. 
The following two important characteristics of the parametric family (P9) will 


be frequently used in the sequel: the Kullback—Leibler divergence and Fisher 
information. 


2.5.1 Kullback—Leibler Divergence 


For any two parameters 0, 0’, the value 


0 ; 
K(Po, Py) = f log PAY p(y, ydveotv) = f LEC.) E00. 8] 909. 8d.) 


Ply, 9’) 

is called the Kullback-Leibler divergence (KL-divergence) between Pg and Py. 
We also write K(0,0’) instead of K(Po, Pg’) if there is no risk of confusion. 
Equivalently one can represent the KL-divergence as 


py, 6) 


K(0,0') = Eg log 
p(y, 6’) 


= Eo[e(y, 0) — &(¥, 6')], 


where Y ~ Po. An important feature of the Kullback—Leibler divergence is that it 
is always non-negative and it is equal to zero iff the measures Pg and Pp, coincide. 


Lemma 2.5.1. For any 0,0’, it holds 
K(0,0') > 0. 


Moreover, K(0,0') = 0 implies that the densities p(y, 0) and p(y, 0’) coincide 
Ho-a.s. 
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Proof. Define Z(y) = p(y, 8’)/ p(y, 8). Then 


/ Z(y) p(y. 8) duto(y) = / 26.8) dnGy=a 


because p(y, 0’) is the density of Pg’ w.r.t. 49. Next, i log(t) = —t~? < 0, thus, 


the log-function is strictly concave. The Jensen inequality implies 


II 


K(0,6") = — : log(Z(y)) p(y. 8) duto(y) > —Ioe( / Z(y)p(y. 8) duo) 


—log(1) = 0. 


II 


Moreover, the strict concavity of the log-function implies that the equality in this 
relation is only possible if Z(y) = 1 Pg-a.s. This implies the last statement of the 
lemma. 


The two mentioned features of the Kullback—Leibler divergence suggest to 
consider it as a kind of distance on the parameter space. In some sense, it measures 
how far Pg’ is from P». Unfortunately, it is not a metric because it is not symmetric: 


K(6,6') 4 K(0',0) 


with very few exceptions for some special situations. 


Exercise 2.5.1. Compute KL-divergence for the Gaussian shift, Bernoulli, Poisson, 
volatility, and exponential families. Check in which cases it is symmetric. 


Exercise 2.5.2. Consider the shift experiment given by the equation Y = 6 +6 
where ¢€ is an error with the given density function p(-) on R. Compute the KL- 
divergence and check for symmetry. 


One more important feature of the KL-divergence is its additivity. 


Lemma 2.5.2. Let pee 0 € ©) and (P??, 0 € ©) be two parametric families 


with the same parameter set ©, and let (Pp = pe x Py 0 € ©) be the product 
family. Then for any 0,0’ € © 


K(Po, Pot) = K( Pq”, Py’) + K(Py”, Py) 


Exercise 2.5.3. Prove Lemma 2.5.2. Extend the result to the case of the m-fold 
product of measures. 

Hint: use that the log-density €(y,, y2,0) of the product measure Pg fulfills 
€(y1, y2, 8) = L(y, 8) + L(y, 8). 


The additivity of the KL-divergence helps to easily compute the KL quantity for 
two measures Py» and Pg describing the i.i.d. sample Y = (Y,..., a The log- 
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density of the measure P» w.rt. fo = ue ” at the point y = (y1,..., Yn)! 


by 


is given 


L(y,0) =) L(y, 8). 


An extension of the result of Lemma 2.5.2 yields 


K (Py, Pe’) = Eg {L(¥,0) — L(Y, 6’)} = nK(0, 6’). 


2.5.2 Hellinger Distance 


Another useful characteristic of a parametric family (P9) is the so-called Hellinger 
distance. For a fixed jz € [0, 1] and any 6, 0’ € ©, define 


iba i, PhS (Fe a) 
_ f (Po) 
= | (Oe) aeOy) 


= / p!'(y.8")p'#(y, 8) dyio(y). 


Note that this function can be represented as an exponential moment of the log- 
likelihood ratio £(Y, 0, 0’) = €(Y, 0) — L(Y, 0’): 


dP y 
h(u, Po, Py’) = Eo exp{we(Y, 0’, 0)} = B(2 ? Hay). 


It is obvious that h(jz, Pe, Pg’) = O. Moreover, h(j, Po, Pg’) < 1. Indeed, the 
function x” for w € [0, 1] is concave and by the Jensen inequality: 


; iL 
Be (Few), < (ETH) =e 


Similarly to the Kullback—Leibler, we often write h(j,6,0') in place of 
h(u, Po, Po’). 
Typically the Hellinger distance is considered for 4 = 1/2. Then 


h(1/2, 0,0") = y p(y, 6) p'2(y, B)duo(y). 
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In contrast to the Kullback—Leibler divergence, this quantity is symmetric and can 
be used to define a metric on the parameter set 0. 
Introduce 


m(u, 4,0") = —logh(u, 0,0") = —log Eg exp{wl(Y, 0’, 0)}. 


The property A(z, 0,0’) < 1 implies m(jz, 0, 0’) > 0. 
The rate function, similarly to the KL-divergence, is additive. 
Lemma 2.5.3. Let (PL, 0 € ©) and (Pe 0 € ©) be two parametric families 


with the same parameter set ©, and let (Pp = Py x Be 0 € ©) be the product 
family. Then for any 6,0' € © and any p € [0, 1] 


m(}, Po, Py) = m(Py”, PO) + m(Py”, P.?), 


Exercise 2.5.4. Prove Lemma 2.5.3. Extend the result to the case of an m-fold 
product of measures. 

Hint: use that the log-density €(y1, y2,0) of the product measure Pp fulfills 
E(y1, v2.0) = LO (y1,8) + (2, 8). 


Application of this lemma to the i.i.d. product family yields 


Mu, 6’, 0) = —log By exp{uL(Y, 6, 0')} =n m(y, 6’, 0). 


2.5.3 Regularity and the Fisher Information: Univariate 
Parameter 


An important assumption on the considered parametric family (Pp) is that the 
corresponding density function p(y, 8) is absolutely continuous w.r.t. the parameter 
6 for almost all y. Then the log-density €(y, @) is differentiable as well with 


det OL(y, 6) 1 dp(y,4) 
Ve(y,0) 2 = 
£009) 00 p(y.0) 00 


with the convention 7 log(0) = 0. In the case of a univariate parameter 6 € R, we 
also write ¢’(y, @) instead of V£(y, 6). 

Moreover, we usually assume some regularity conditions on the density p(y, 0). 
The next definition presents one possible set of such conditions for the case of a 
univariate parameter 0. 


Definition 2.5.1. The family (Ps, 6 € © C R) is regular if the following condi- 
tions are fulfilled: 


1. The sets A(@) “ {y: p(y, 8) = 0} are the same for all 6 € ©. 
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2. Differentiability under the integration sign: for any function s(y) satisfying 
[POrr0.P du sc. 8 ee 


it holds 


p(y, ®) 
6 


0 0 
ag f sar = f s0»P0.8)duo(v) = f 90) PE 


duo(y). 


3. Finite Fisher information: the log-density function £(y, @) is differentiable in 6 
and its derivative is square integrable w.r.t. Pg: 


lp’(y, |? 
D(y,®) 


The quantity in the condition (2.8) plays an important role in asymptotic 
statistics. It is usually referred to as the Fisher information. 


[leo.e[-arew = dine: (2.8) 


Definition 2.5.2. Let (P5,@ € © C R) be a regular parametric family with the 
univariate parameter. Then the quantity 


Ip'(y. 9)? 
P(y,8) 


def 


F(@) [160-20 dno) = Rien 


is called the Fisher information of (Pg) at 0 € ©. 


The definition of F(9) can be written as 
F() = Eo|e’(Y, )|" 


with Y ~ Pg. 
A simple sufficient condition for regularity of a family (Pg) is given by the next 
lemma. 


Lemma 2.5.4. Let the log-density €(y,@) = log p(y, 6) of a dominated family 
(Po) be differentiable in @ and let the Fisher information F(@) be a continuous 
function on ©. Then (Po) is regular. 


The proof is technical and can be found, e.g., in Borokov (1998). Some useful 
properties of the regular families are listed in the next lemma. 


Lemma 2.5.5. Let (Pg) be a regular family. Then for any 96 € @ and Y ~ Po 
I. Eo£'(Y,0) = f Uy, 8) p(y, 9) dpo(y) = 0 and F(6) = Vare|¢’(Y, 4)]. 
2. F(@) = —Egl"(¥,0) = — f L"(y, p(y, dpo(y). 


Proof. Differentiating the identity [ p(y, 0)duo(y) = f exp{l(y, )}duo(y) = 1 
implies under the regularity conditions the first statement of the lemma. Differen- 
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tiating once more yields the second statement with another representation of the 
Fisher information. 


Like the KL-divergence, the Fisher information possesses the important additiv- 
ity property. 
Lemma 2.5.6. Let (PO, 0 € ©) and (Pr, 6 € ©) be two parametric families 


with the same parameter set ©, and let (Pg = py x Pp, 6 € ©) be the product 
family. Then for any @ € ©, the Fisher information F(@) satisfies 


F(@) = FY (6) + FO (6) 


where F\) (6) (resp. F® (0)) is the Fisher information for (P{”) (resp. for ce), 
Exercise 2.5.5. Prove Lemma 2.5.6. 

Hint: use that the log-density of the product experiment can be represented as 
L(y1, y2, 9) = £101, 8) + £2(92, 0). The independence of Y, and Y> implies 

F(0) = Varg[€(Y1, Yo, 0)| = Vare[£, (M1, 0) + £5(Y2, 4)] 
= Varg[¢,(¥1, )] + Vare[l5(¥o, 8)]. 

Exercise 2.5.6. Compute the Fisher information for the Gaussian shift, Bernoulli, 
Poisson, volatility, and exponential families. Check in which cases it is constant. 


Exercise 2.5.7. Consider the shift experiment given by the equation Y = 6 + ¢ 
where ¢ is an error with the given density function p(-) on R. Compute the Fisher 
information and check whether it is constant. 


Exercise 2.5.8. Check that the i.i.d. experiment from the uniform distribution on 
the interval [0, 8] with unknown @ is not regular. 


Now we consider the properties of the i.i.d. experiment from a given regular 
family (Pg). The distribution of the whole i.i.d. sample Y is described by the 
product measure Pg = P,°” which is dominated by the measure ty = [6°". The 
corresponding log-density L(y, 6) is given by 


def 


dPe 
L(y, 6) = log = —(y) = Di, 8). 
Ko 
The function exp L(y, @) is the density of P» w.rt. 49 and hence, for any r.v. & 
Egé = Eo[é exp L(Y, 0)]. 


In particular, for § = 1, this formula leads to the identity 


Eo[exp L(Y, 6)] = [orto O}notay) =1, (2.9) 
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The next lemma claims that the product family (Pg) for an i.i.d. sample from a 
regular family is also regular. 


Lemma 2.5.7. Let (Pg) be a regular family and Pg = Pee Then 


1. The set A, e fy = O1,---.¥n) tT] pOi. 9) = 0} is the same for all 6 € ©. 
2. For any rv. S = S(Y) with Eg S? < C, 0 € ®, it holds 


GI 


aK = pls exp L(Y, 6)| =] 


50 30 o[ SL'(Y, 0) exp L(Y, 0)], 


where L(Y ,0) = & L(Y, 0). 


3. The derivative L'(Y , 0) is square integrable and 


Eo|L'(Y,6)| = nF(8). 


2.5.4 Local Properties of the Kullback—Leibler Divergence and 
Hellinger Distance 


Here we show that the quantities introduced so far are closely related to each other. 
We start with the Kullback—Leibler divergence. 


Lemma 2.5.8. Let (P9) be a regular family. Then the KL-divergence K(0, 0’) 
satisfies: 


r _ 
KG,8)) =O: 
d / 
gee ) pages 
2 
Ss / = 
eK.) = FO) 


Ina small neighborhood of 0, the KL-divergence can be approximated by 


K(0, 6’) = F(6)|6’ — 6|2/2. 


Similar properties can be established for the rate function m(w, 0, 0’). 


Lemma 2.5.9. Let (Pe) be a regular family. Then the rate function m(, 8, 6’) 
satisfies: 


m(LL, 6, 6’) 


= 0, 
0’=0 


0, 


d 
— 6,0" = 
ae : Noro 
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2 


d 
—,; m(1, 6, 6’) r 


a _, = HU - WFO). 


Ina small neighborhood of 0, the rate function m(\1, 0, 0’) can be approximated by 
mL, 0,6) = wl — w)E(A)|6" — 6|?/2. 
Moreover, for any 0,6' € @ 


m(, 0, 0’) = 0, 
lL 


d 

= n(u, 0, 6’) Eo£(Y, 6,6’) = K(0, 6"), 
du p=0 

2 


d 
—— 6.64’ 
ae 6’) 


— Vara Lem, 6, 6’). 


u=0 


This implies an approximation for 4 small 


2 
m(t, 0,0") = wK(6, 6’) — > Vary [€(Y, 6, 6’)]. 


Exercise 2.5.9. Check the statements of Lemmas 2.5.8 and 2.5.9. 


2.6 Cramér—Rao Inequality 


Let 6 be an estimate of the parameter @*. We are interested in establishing a lower 
bound for the risk of this estimate. This bound indicates that under some conditions 
the quadratic risk of this estimate can never be below a specific value. 


2.6.1 Univariate Parameter 


We again start with the univariate case and consider the case of an unbiased estimate 
0. Suppose that the family (Ps, @ € ©) is dominated by a o-finite measure fo on 
the real line and denote by p(y, @) the density of Pg w.r.t. j4o: 


ef UP 
p(y.0) = (9). 
Ho 


Theorem 2.6.1 (Cramér—Rao Inequality). Lert 6=06 (Y) be an unbiased esti- 
mate of 0 for an i.i.d. sample from a regular family (Pg). Then 
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Eg|0 — |? = Varg(6) > : 

= = ar —S 

ay = aR(O) 


with the equality iff 0 —6 = {nF(@)}"'L'(Y, @) almost surely. Moreover, if is not 
unbiased and t(0) = E98, then with t' (0) 4 7(6), it holds 


Ol 
Vara (0) = @) 
and 
= - , 2 
E5|6 — 6|2 = Varo(6) + |x() — 6? > oa + [r(6) — 6. 


Proof. Consider first the case of an unbiased estimate 6 with Eo@ = 0. Differenti- 
ating the identity (2.9) Eg exp L(Y, 0) = 1 w.rt. 8 yields 


0= i [L'(y, 0) exp{L(y, 0} |uo(dy) = EoL’(Y, 6). (2.10) 
Similarly, the identity E96 = 0 implies 
i= [lave 0) exp{L(Y,0)\]uo(dy) = Eo[OL'(Y, 4)]. 
Together with (2.10), this gives 
E[(6 — @)L'(Y, 6)] = 1. (2.11) 
Define h = {nF(0)}~'L'(Y, 0). Then E{hL'(Y, 0)} = | and (2.11) yields 
Es[(6 — 6 —h)h] =0. 


Now 


E9(0 — 0)? = Eg(0 -0 —h +h)? = Ep(6 — 6 —h)? + Eh? 
= Eo(6 — 6 —h)? + {nF(0)}' > {nF} 


with the equality iff 6-@= {nF(6)}"'L’(¥, 6) almost surely. This implies the 
first assertion. 

Now we consider the general case. The proof is similar. The property (2.10) 
continues to hold. Next, the identity Eg9 = @ is replaced with Eg@ = 1(€) yielding 


E,[OL'(¥, 6)] = 1'(0) 


2.6 Cramér—Rao Inequality 39 


and 
Eg[{6 — c(0)}L'(¥, 0)] = 1’(6). 
Again by the Cauchy—Schwartz inequality 
|="(0)|? = BB[t6 — c()}L'(Y,, 6) 
< Eo {0 — 1(0)}? Eo|L'(¥, 0)? 
= Varg(6) nF(0) 


and the second assertion follows. The last statement is the usual decomposition of 
the quadratic risk into the squared bias and the variance of the estimate. 


2.6.2 Exponential Families and R-Efficiency 


An interesting question is how good (precise) the Cramér—Rao lower bound is. 
In particular, when it is an equality. Indeed, if we restrict ourselves to unbiased 
estimates, no estimate can have quadratic risk smaller than [nF(0)]~'. If an estimate 
has exactly the risk [nF(0)|~!, then this estimate is automatically efficient in the 
sense that it is the best in the class in terms of the quadratic risk. 


Definition 2.6.1. An unbiased estimate 6 is R-efficient if 


Varg(0) = [nF(6)}-!. 


Theorem 2.6.2. An unbiased estimate 6 is R-efficient if and only if 
6=n7! >: U(Y;), 


where the function U(-) on R satisfies [ U(y)dP6(y) = @ and the log-density 
L(y, 0) of Pe can be represented as 


£(y, 8) = C(@)U(y) — BA) + L(y), (2.12) 


for some functions C(-) and B(-) on © and a function €(-) on R. 


Proof. Suppose first that the representation (2.12) for the log-density is correct. 
Then £'(y,0) = C’(0)U(y) — B’(@) and the identity E,t'(y, 6) = 0 implies the 
relation between the functions B(-) and C(-): 


6C'(0) = B'(6). (2.13) 
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Next, differentiating the equality 


0= fry) -e}aPo») = [W0)— He dpaly) 
w.r.t. 0 implies in view of (2.13) 
l= Eo[{U(Y) — 6x {c’(a)U(Y) — B'(0)} | = C'(0)Ee{U(Y) _ gy. 


This yields Varg { U( Y)} = 1/C’(@). This leads to the following representation for 
the Fisher information: 


F(60) = Varg {e'(Y, 6)} 
= Vare{C’(@)U(Y) — B'(8)} 
= {C'(0)}" Varg{U(Y)} = C'(0). 


The estimate 8 = n~! )~ U(Y;) satisfies 


that is, it is unbiased. Moreover, 


1 1 
nC'(0)  nF(6) 


Vary (6 ) = Vare{— um} = 5 varlU%)} = 


and 6 is R-efficient. 
Now we show a reverse statement. Due to the proof of the Cramér—Rao 
inequality, the only possibility of getting the equality in this inequality is if 
L'(Y,0) = nF(6) (6 — 8). 


This implies for some fixed 0) and any 6° 
@° 
L(Y, 6°) — L(Y, %) = } L'(Y¥,6)d0 
ay 


é 
= i nF(6)(6 — 0)d0 = n{6C(6) — B(6)} 


4 


with C(6) = i F(@)d6 and B(@) = cM 6F(6)d0. Applying this equality to a 
sample with n = 1 yields U(Y) = 6(Y;), and 


€(N1, 8) = (M1, A) + C(@)U(N) — BY). 
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The desired representation follows. 


Exercise 2.6.1. Apply the Cramér—Rao inequality and check R-efficiency to the 
empirical mean estimate 0 = n-! >- Y; for the Gaussian shift, Bernoulli, Poisson, 
exponential, and volatility families. 


2.7 Cramér-Rao Inequality: Multivariate Parameter 


This section extends the notions and results of the previous sections from the case 
of a univariate parameter to the case of a multivariate parameter with 0 ¢ © C R?. 


2.7.1 Regularity and Fisher Information: Multivariate 
Parameter 


The definition of regularity naturally extends to the case of a multivariate parameter 
0 = (A,..., Bays It suffices to check the same conditions as in the univariate case 
for every partial derivative dp(y,@)/00; of the density p(y, @) for j = 1,..., p. 


Definition 2.7.1. The family (P9,9 € © C R?) is regular if the following 
conditions are fulfilled: 


1. The sets A(0) < {y: p(y, 9) = 0} are the same for all 8 € ©. 
2. Differentiability under the integration sign: for any function s(y) satisfying 


/ (y) p(y. 0)dpoly) <C, 0 EO 


it holds 


apy, @) 


0 ) 
a | 04P00) = = f soypo.8 duo) = f 9) PEP duo, 


3. Finite Fisher information: the log-density function ¢(y, @) is differentiable in 0 
and its derivative V2(y, 0) = 0£(y, 0)/06 is square integrable w.r.t. Po: 


IV p(y. 9)/? 
P(y,®) 


In the case of a multivariate parameter, the notion of the Fisher information leads 
to the Fisher information matrix. 


[ive.9)/-aray = dyig(y) < oo. 
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Definition 2.7.2. Let (Ps, 8 €¢ © C R”) bea parametric family. The matrix 


def 


F(@) = [¥e0. 897.6). )du0(9) 


1 
= J 920.899" pty. 0) doy) 
D(y.9) 


is called the Fisher information matrix of (Pg) at 0 € ©. 


This definition can be rewritten as 
F(@) = Ee[V2(%1, A{VE(%, 6} "J. 


The additivity property of the Fisher information extends to the multivariate case as 
well. 


Lemma 2.7.1. Let (P9,9 € ©) be a regular family. Then the n-fold product family 
(P9) with Py = Pee is also regular. The Fisher information matrix F(@) satisfies 


Eo[VL(Y, 0=){VL(Y, 0)}"] = nF(@). (2.14) 


Exercise 2.7.1. Compute the Fisher information matrix for the i.i.d. experiment 
Y; = 6 + o¢; with unknown @ and o and ¢; i.i.d. standard normal. 


2.7.2 Local Properties of the Kullback—Leibler Divergence 
and Hellinger Distance 


The local relations between the Kullback—Leibler divergence, rate function, and 
Fisher information naturally extend to the case of a multivariate parameter. We start 
with the Kullback—Leibler divergence. 


Lemma 2.7.2. Let (P9) be a regular family. Then the KL-divergence K(0, 0") 
satisfies: 


K(0, 6’ = 
Of) 6/=0 °, 
d j 
ao ") 0/=0 ” 
d2 
/ 
Pre 6 ) ra — F(@). 


Ina small neighborhood of 0, the KL-divergence can be approximated by 


K(0,0') = (0’ —0)'F(0) (0’ — 0)/2. 


2.7 Cramér—Rao Inequality: Multivariate Parameter 43 


Similar properties can be established for the rate function m(w, 6, 0’). 


Lemma 2.7.3. Let (Po) be a regular family. Then the rate function m(\, 0,0) 
satisfies: 


mi, 8") = 0, 


d / 
ag Ot) eg 


d2 
pet pag 7 MO WFO). 


Ina small neighborhood of 0, the rate function can be approximated by 
m(1,0,6") ~ (1 — 1)(6’ — 0)" F(8) (6' — )/2. 
Moreover, for any 0,0' € © 


m(w,0,0')} = 0, 
LL 


d 
mu, 0,6’) Eo l(¥, 0,0’) = K(6, 6’), 
du pu=0 


2 


d 
= 6.6’ 
qa 8.6)| 


—Varg[€(Y, 0, 0’)]. 
This implies an approximation for 4 small: 
we 
m(u, 0,0’) = wK(O, 0’) — = Varg [em 0, 6’)]. 


Exercise 2.7.2. Check the statements of Lemmas 2.7.2 and 2.7.3. 


2.7.3 Multivariate Cramér—Rao Inequality 


Let 0 = O(Y ) be an estimate of the unknown parameter vector. This estimate is 
called unbiased if 


Theorem 2.7.1 (Multivariate Cramér—Rao Inequality). Let 6 = 6(Y) be an 
unbiased estimate of 0 for an i.i.d. sample from a regular family (Pe). Then 
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Varo (6) = {nF(6)} 
Eo || — 0||? = tr{ Var ()} > tr {nF(@)} J. 
Moreover, if 0 is not unbiased and t(0) = E96, then with Vt(0) = “7(8), it 
holds 


Varg (6) > Vr(0) {nF(@)} |" {Vr(8)} 
and 


Eg ||6 — 0 ||? = tr[ Var (6)] + ||r(8) — 0 || 
tr[Vr(0) {nF(8)} '{Vr()}"] + Ilr) — O|/2. 


IV 


E90 = 0. Differenti- 


Proof. Consider first the case of an unbiased estimate 6 with I 
ating the identity (2.9) Eg exp L(Y, 0) = 1 w.rt. @ yields 


0= / VL(y, 0) exp{L(y, @)} Mo(dy) = Eo[VL(Y, 0)| =0. (2.15) 


Similarly, the identity E96 = 0 implies 
1 = [ 8(y) {VL(y.8)}" exp L(y, 8)} old y) = Eo[8 {VL 8)}7]. 
Together with (2.15), this gives 
E9[(6 — 0) {VL(Y,6)}"] = 7. (2.16) 
Consider the random vector 
h = {nF(6)}'VL(Y, 8). 
By (2.15) Egh = 0 and by (2.14) 


Varg(h) = Eg (hh!) =n Ee[17'(0)VL(Y,0){1~'(0)VL(Y, 6)}"] 
=n I'(0)Ee[VL(Y, O){VL(Y, 0)}" ]J-"(0) = {nF(0)\ 


and the identities (2.15) and (2.16) imply that 


E9[(@ —6 —h)h'] =0. (2.17) 
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The “no bias” property yields Eg (6 - 0) = Oand Eg [6 —0)(6—0 "| = Vara (0). 
Finally by the orthogonality (2.17) and 


Vare (8) = Varg(h) + Var(@ — 6 — h) 


= {nF(0)} "| + Vare (6 — 0 — h) 


and the variance of @ is not smaller than {nF(0 yr Moreover, the equality is only 
possible if 6-0-his equal to zero almost surely. 

Now we consider the general case. The proof is similar. The property (2.15) 
continues to hold. Next, the identity Eg@ = 6 is replaced with Eg@ = 1(0) 
yielding 

Eo [6 {VL(Y,6)}'] = Vr(0) 
and 
5 T 
Eo[{O —r(@) {VL(Y,6)} ] = Vr(8). 
Define 
h = Vr(0) {nF(0)} | VL(Y, 6). 
Then similarly to the above 
E[hh™] = Vr(0) {nF(0)} | {Vr(8)} 
E[(6 —@ —h)h'] =0, 


and the second assertion follows. The statements about the quadratic risk follow 
from its usual decomposition into squared bias and the variance of the estimate. 


2.7.4 Exponential Families and R-Efficiency 


The notion of R-efficiency naturally extends to the case of a multivariate parameter. 


Definition 2.7.3. An unbiased estimate 6 is R-efficient if 
Varg (0) = {nF(0)} 
Theorem 2.7.2. An unbiased estimate 8 is R-efficient if and only if 


6 =n! STUY), 
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where the vector function U(-) on R satisfies [ U(y)dP9(y) = @ and the log- 
density £(y, 0) of Pg can be represented as 


€(y, 0) = C(0)'U(y) — BO) + L(y), (2.18) 


for some functions C (-) and B(-) on © and a function €(-) on R. 


Proof. Suppose first that the representation (2.18) for the log-density is correct. 


eigeyes by C’(@) the p x p Jacobi matrix of the vector function C: C’(@) = = 


“C(0). Then V£(y, 0) = C’(0)U(y) — VB(@) and the identity Ey Vl(y, 0) = 0 
iraplies the relation between the functions B(-) and C (-): 


C’'(0) 0 = VB(8). (2.19) 


Next, differentiating the equality 


= fluo) -o] are») = [UO - Ble" dpoly) 
w.r.t. 8 implies in view of (2.19) 


I = Eg[{U(Y) — 0} {C'()U(Y) — VB@)}]" 
= C'(0)E9[{U(Y) — 6} {U(Y) — 63°]. 


This yields Varg [U (Y ) = [C’(@)]"!. This leads to the following representation 
for the Fisher information matrix: 


F(@) = Varg[VE(Y, 8)] = Vare[C’(0)U(Y) — VB(O)] 
= [C'(@)] Vare[U(Y)] = C(). 


The estimate 6 = n7! >- U(Y;) satisfies 
E90 = 6, 
that is, it is unbiased. Moreover, 
Vary (0 = Vatg (- SS U(Y; )) 
= 5 y_ Var[U(¥;)] = “(cay = {nF(@)} 


and 6 is R-efficient. 
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As in the univariate case, one can show that equality in the Cramér—Rao bound 
is only possible if VL(Y, 6) and @ — @ are linearly dependent. This leads again to 
the exponential family structure of the likelihood function. 


Exercise 2.7.3. Complete the proof of the Theorem 2.7.2. 


2.8 Maximum Likelihood and Other Estimation Methods 


This section presents some other popular methods of estimating the unknown 
parameter including minimum distance and M-estimation, maximum likelihood 
procedure, etc. 


2.8.1 Minimum Distance Estimation 


Let p(P, P’) denote some functional (distance) defined for measures P, P’ on the 
real line. We assume that p satisfies the following conditions: p(P»9,, Pe,) = 0 and 
p(Po,, Po.) = Oiff 6; = 82. This implies for every 0* € © that 


argmin p(P», Py+) = 0*. 
reo) 


The Glivenko—Cantelli theorem states that P,, converges weakly to the true distribu- 
tion Py«. Therefore, it is natural to define an estimate 6 of 0* by replacing in this 
formula the true measure P= by its empirical counterpart P,,, that is, by minimizing 
the distance p between the measures Py and P,, over the set (Pg). This leads to the 
minimum distance estimate 


6 = argmin p(Po, P,,). 


0€O 


2.8.2. M-Estimation and Maximum Likelihood Estimation 


Another general method of building an estimate of 0*, the so-called M-estimation 
is defined via a contrast function y(y, 0) given for every y € R and @ € ©. The 
principal condition on yw is that the integral Eg, w(Y, 0’) is minimized for 6 = 0’: 


6= argmin [ w(y, 0’) dPo(y), 06690. (2.20) 
6’ 
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In particular, 
6* = argmin f yy. 0) Py (9). 
6<O 


and the M-estimate is again obtained by substitution, that is, by replacing the true 
measure Pg* with its empirical counterpart P,,: 


s 1 
d= argmin [ w(y, 9) dP,(y) = eS Ss w(Y;, 0). 


0E€O 


Exercise 2.8.1. Let Y be ani.id. sample from P € (Pp, € OC R). 
(i) Let also g(y) satisfy { g(y) dPe(y) = 9, leading to the moment estimate 
Gan D2): 
Show that this estimate can be obtained as the M-estimate for a properly 
selected function y(-). 
(ii) Let f g(y)dPo(y)=m(@) for the given functions g(-) and m(-) whereas 
m(-) is monotonous. Show that the moment estimate 6=m7!(M,,) with 


M, =n! > g(¥;) can be obtained as the M-estimate for a properly selected 
function y(-). 


We mention three prominent examples of the contrast function w and the 
resulting estimates: least squares, least absolute deviation (LAD), and maximum 
likelihood. 


2.8.2.1 Least Squares Estimation 


The least squares estimate (LSE) corresponds to the quadratic contrast 


W(y. 8) = Ilg(y) — 4, 


where g(y) is a p-dimensional function of the observation y satisfying 


[ eorarewy =o. 0€°0O. 


Then the true parameter 0* fulfills the relation 


6" = argmin / le(v) — 82 dPy+(y) 
dE€O 
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because 
[leo - 61 are) = 0°? + f Ig) —6° 1 dP y=. 


The substitution method leads to the estimate 6 of @* defined by minimization of 
the empirical version of the integral [ || g() — 8 ||° dP9*(y): 


4 def : : 
6 © aremin iq ll¢(v) — 0[2 dP, (y) = aremin J |g (¥%)) — OI. 
(I=0) (X=) 


This is again a quadratic optimization problem having a closed form solution called 
least squares or ordinary LSE. 


Lemma 2.8.1. It holds 
. 1 
6 = argmin ) | lg%) — I? = — beh). 
a 


One can see that the LSE @ coincides with the moment estimate based on the 
function g(-). Indeed, the equality { g(y)dP»9*(y) = 4% leads directly to the LSE 


6 =n!> g(¥). 


2.8.2.2 LAD (Median) Estimation 


The next example of an M-estimate is given by the absolute deviation contrast 
fit. For simplicity of presentation, we consider here only the case of a univariate 


parameter. The contrast function w(y, 6) is given by wW(y, 0) “ ly — 6|. The 
solution of the related optimization problem (2.20) is given by the median med( Pp) 
of the distribution Pp. 


Definition 2.8.1. The value ¢ is called the median of a distribution function F if 
F(t) > 1/2, F(t—) < 1/2. 
If F(-) is a continuous function, then the median t = med(F) satisfies F(t) = 1/2. 


Theorem 2.8.1. For any cdf F, the median med(F) satisfies 


inf / ly —6| dF) = y ly — med(F)| dF(y). 


Proof. Consider for simplicity the case of a continuous distribution function F’. One 
has |y — 6] = (@—y)1(y < 60) + (vy — 9)1Q = 8). Differentiating w.r.t. 6 yields 
the following equation for any extreme point of f |y — 0| dF(y): 
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-f- ary) + fo dF(y) = 0. 


The median is the only solution of this equation. 


Let the family (P¢) be such that 9 = med(Pe) for all 6 € R. Then the M- 
estimation approach leads to the LAD estimate 


é# argmnin f ly — O|dFi(y) = argmin ) > |Y; — 0|. 
6eR 6eR 


Due to Theorem 2.8.1, the solution of this problem is given by the median of the 
edf Fy. 


2.8.2.3 Maximum Likelihood Estimation 


Let now w(y,6) = —€(y, 0) = —log p(y, 8) where p(y, @) is the density of the 
measure Pg at y w.r.t. some dominating measure jlo. This choice leads to the MLE: 


~ 1 
0 = argmax— ) log p(Yj,@). 
geQ NM z 


The condition (2.20) is fulfilled because 


argmin [ yr(y.6")dPo() argmin f {¥(y. 6°) Wy. 0)} dPa(0) 
6’ 6’ 


9 
argmin [ log POO) dP9(y) 
0! Py.) 


II 


argminK(0,6') = 0. 
6’ 


Here we used that the Kullback—Leibler divergence K(6, 6’) attains its minimum 
equal to zero at the point 6’ = @ which in turn follows from the concavity of the 
log-function by the Jensen inequality. 

Note that the definition of the MLE does not depend on the choice of the 
dominating measure jl. 


Exercise 2.8.2. Show that the MLE 6 does not change if another dominating 
measure is used. 


Computing an M-estimate or MLE leads to solving an optimization problem 
for the empirical quantity )> y(¥;, 6) w.r.t. the parameter 6. If the function yw is 
differentiable w.r.t. 0, then the solution can be found from the estimating equation 


0 
ag 2 i 8) = 0. 
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Exercise 2.8.3. Show that any M-estimate and particularly the MLE can be 
represented as minimum distance estimate with a properly defined distance p. 


Hint: define p(P», Pg) as [[W(y,0) — Wy. 0") |dPy(y). 


Recall that the MLE @ is defined by maximizing the expression L(@) = 
> £(¥;, 0) w.r.t. 8. Below we use the notation L(0, 6’) “ L(0) — L(@’), often 
called the log-likelihood ratio. 7 

In our study we will focus on the value of the maximum L(@) = maxg L(@). 


Let L(@) = >- €(Y;, @) be the likelihood function. The value 
L(6) = max L(8) 


is called the maximum log-likelihood or fitted log-likelihood. The excess L(6) - 
L(0%*) is the difference between the maximum of the likelihood function L(0) over 
6 and its particular value at the true parameter 0*: 


def 


L(6,0*) = max L(8) — L(0*). 


The next section collects some examples of computing the MLE 6 and the 
corresponding maximum log-likelihood. 


2.9 Maximum Likelihood for Some Parametric Families 


The examples of this section focus on the structure of the log-likelihood and the 
corresponding MLE @ and the maximum log-likelihood L(@). 


2.9.1 Gaussian Shift 


Let Pg be the normal distribution on the real line with mean 9 and the known 
variance o”. The corresponding density w.r.t. the Lebesgue measure reads as 


p(y, 9) = 


(y= 9)" 
Fees Oa 


The log-likelihood L(9) is 


1 


L() = J log p(¥;, 0) = —5 log(2x0”) — > ) (vi — 8)’. 
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The corresponding normal equation L'(@) = 0 yields 
1 
=“ bie —6)=0 (2.21) 


leading to the empirical mean solution 6=n! YY}. 
The computation of the fitted likelihood is a bit more involved. 


Theorem 2.9.1. Let Y; = 6* + 8; with s; ~ N(0, 07). For any 6 
L(6,0) = no~?(6 — 6)?/2. (2.22) 
Moreover, 
L(6,0*) = no~?(6 — 0*)?/2 = &7/2 


where & is a standard normal rv. so that 2L(6, 6*) has the fixed yj distribution with 
one degree of freedom. If 3a is the quantile of x7/2 with P(E /2 > 3a) = a, then 


EG) = {a LG, =e} (2.23) 


is an a-confidence set: Pox (E(3q) # 8*) = a. 
For everyr > 0, 


Eg*|2L(6,6*)|’ =c,, 


where c, = E|§|*" with € ~ N(O, 1). 


Proof (Proof 1). Consider L(6, 0) = L(6) — L(@) as a function of the parameter 0. 


Obviously 
i 1 e 
L6,9) =-— )[% - 6) -(% - 6)", 
(6.0) => D[% - 6) - i - 8] 


so that L(6, @) is a quadratic function of 6. Next, it holds L(G, 9) | oa = 0 and 


£16, 9)| 9-5 = -~£L(6)|,_¢ = 0 due to the normal equation (2.21). Finally, 


a a? ; 


This implies by the Taylor expansion of a quadratic function L(6, 0) at 0 = 0: 
L(6,6) = —-(6 — 6° 
. — IG2 . 


Proof 2. First observe that for any two points 6’,0@, the log-likelihood ratio 
L(6’,0) = log(dP»9/dP,») = L(6’) — L(@) can be represented in the form 
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L(6’,0) = L(6’) — L(6) = 0 *(S —n6)(6' — 6) — no? (6 — 6)?/2. 
Substituting the MLE 6=S /n in place of 6’ implies 
L(6,0) =no?(6 — 6)"/2. 
Now we consider the second statement about the distribution of L(6, 0*). The 


substitution 9 = 9* in (2.22) and the model equation Y; = 0* + ¢; imply 6-6* = 
n—'/2¢&, where 


det | 
5= ara 


is standard normal. Therefore, 
L(6,0*) = &?/2. 
This easily implies the result of the theorem. 


We see that under Py,» the variable 2L(6, 0*) is ¥% distributed with one degree 
of freedom, and this distribution does not depend on the sample size 1 and the scale 
parameter o. This fact is known in a more general form as chi-squared theorem. 


Exercise 2.9.1. Check that the confidence sets 


def 


E° (Za) = (6 — nV? 02%_,6 + nV 024), 


where Z, is defined by 2®(—z,) = a, and €(3,) from (2.23) coincide. 
Exercise 2.9.2. Compute the constant c, from Theorem 2.9.1 for r = 0.5, 1, 1.5, 2. 


Already now we point out an interesting feature of the fitted log-likelihood 
L(@, @*). It can be viewed as the normalized squared loss of the estimate 6 because 
L(6,0*) =no~*|@ — 6*|?. The last statement of Theorem 2.9.1 yields that 


Eo |0 —6*|" =c.07%n. 


2.9.2 Variance Estimation for the Normal Law 


Let Y; be i.i.d. normal with mean zero and unknown variance 0*: 
Y; ~ N(0, 0*), O* Ee Ry. 


The likelihood function reads as 


1 
L(@) = Y “log p(Y, gd) = —5 log(26) — 38 >: fa? 
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The normal equation L’(6) = 0 yields 


n 1 
LQ) == 4 y2= 
(0) = 35 + 593 LY = 0 


leading to 


Dm 
ll 


1 
—Sn 
n 


with S, = >> i, Moreover, for any 0 


L(6, 0) = —5 log(6/6) = S1/8 — 1/6) =nk(6, 0) 
where 
K(0, 6’) = —S[ios(o/6" +1-0/6" 


is the Kullback—Leibler divergence for two Gaussian measures N(0, 6) and N(0, 0’). 


2.9.3. Univariate Normal Distribution 


Let Y; be as in previous example N{a, o} but neither the mean a nor the variance 
o? are known. This leads to estimating the vector 8 = (6,6) = (a,07) from the 
iid. sample Y. 

The maximum likelihood approach leads to maximizing the log-likelihood w.r.t. 
the vector 6 = (a,07)!: 


1 
L(O) = Plog pW. @) = —5 logtxr62) — 55 Yi ~ 64)”. 


Exercise 2.9.3. Check that the ML approach leads to the same estimates (2.2) as 
the method of moments. 


2.9.4 Uniform Distribution on [0, 0} 


Let Y; be uniformly distributed on the interval [0, @*] of the real line where the right 
end point 9* is unknown. The density p(y, @) of Pg w.r.t. the Lebesgue measure is 
6—'1(y < 9). The likelihood reads as 
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Z(6) = 6 "1(max Y; < 9). 
I 


This density is positive iff @ > max; Y; and it is maximized exactly for 6 = max; Y;. 
One can see that the MLE @ is the limiting case of the moment estimate 6; as k 
grows to infinity. 


2.9.5 Bernoulli or Binomial Model 


Let Pp be a Bernoulli law for 0 € [0, 1]. The density of Y; under Po can be written as 


p(y,6) =@(1-6)', 


The corresponding log-likelihood reads as 


L(@) = Paes log @ + (1 — Y;) log(1 — 6)} = S, log i u 7 +n log(1 — 64) 


with S, = >> Y;. Maximizing this expression w.r.t. 6 results again in the empirical 
mean 


6 =S,,/n. 


This implies 


2 ~ 6 7 1 
L(@, 0) = n6 log a + n(1 — €) log i 


a?) 2 

PT ses nK(6, 6) 

where K(6, 6’) = 6 log(6/6’) + (1-86) log{(1—6)/(1—@’) is the Kullback—Leibler 
divergence for the Bernoulli law. 


2.9.6 Multinomial Model 


The multinomial distribution B;’ describes the number of successes in m experi- 
ments when one success has the probability @ € [0,1]. This distribution can be 
viewed as the sum of m binomials with the same parameter 6. 

One has 


BYES (7 ot —oynk k= 0,...,m. 
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Exercise 2.9.4. Check that the ML approach leads to the estimate 


Compute L(@ 9). 


2.9.7 Exponential Model 


Let Y;,..., Y, be iid. exponential random variables with parameter 6* > 0. This 
means that Y; are nonnegative and satisfy P(Y; > t) = e~’/ 6” The density of 
the exponential law w.r.t. the Lebesgue measure is p(y,6*) = e~*/°" /0*. The 
corresponding log-likelihood can be written as 


L(@) = -nlogd—5°Y;/0 = —S/0—nlogd, 
i=l 


where S = Y, +... + Yy. 
The ML estimating equation yields S/67 = n/6 or 


6=S/n. 
For the fitted log-likelihood L(@ , 9) this gives 
L(6,0) = —n(1 — 6/0) —nlog(6/0) = nK(6, 0). 


Here once again K(6, 6’) = 6/6’—1—log(0/6’) is the Kullback—Leibler divergence 
for the exponential law. 


2.9.8 Poisson Model 


Let Yj,...,Y, be iid. Poisson random variables satisfying P(Y¥; =m) = 
\0*|"e-*" /m! for m = 0,1,2,.... The corresponding log-likelihood can be 
written as 


L(@) = Y“log(6% e~*/¥;!) = log Y°Y; — 6 —-log(¥;!) = Slog@-—nO+ R, 


i=l i=1 


where S = Y, +... + Y, and R = )~7_, log(Y;!). Here we leave out that 0! = 1. 
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The ML estimating equation immediately yields $/@ = n or 
6=S/n. 
For the fitted log-likelihood L(6, @) this gives 
L(6,0) = nO log(6/0) — n(6 — 0) = nK(6, 0). 


Here again K(6, 0’) = 6 log(@/6’) — (@ — 0’) is the Kullback—Leibler divergence 
for the Poisson law. 


2.9.9 Shift of a Laplace (Double Exponential) Law 


Let Po be the symmetric distribution defined by the equations 

PoINi| >) =e”, y>0, 
for some given o0 > OQ. Equivalently one can say that the absolute value of Y, 
is exponential with parameter o under Py. Now define Pg by shifting Po by the 
value 0. This means that 


Pa(\¥i-O| >y) =e", y= 0. 


The density of Y; — @ under Pp is p(y) = (20)7'e9!/", The maximum 
likelihood approach leads to maximizing the sum 


L(8) = —nlogQ2a) — ) 11%: — 8l/o, 
or equivalently to minimizing the sum )> |Y; — 0|: 


6 = argmin Y *|Y; — 4]. (2.24) 
gt >| | 


This is just the LAD estimate given by the median of the edf: 
6= med(F,,). 


Exercise 2.9.5. Show that the median solves the problem (2.24). 

Hint: suppose that 1 is odd. Consider the ordered observations Yq) < Yi) < 
... < Yq). Show that the median of P, is given by Y((n+1)/2). Show that this point 
solves (2.24). 
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2.10 Quasi Maximum Likelihood Approach 


Let Y = (%,..., y 3° be a sample from a marginal distribution P. Let also 
(Po,9 € ©) be a given parametric family with the log-likelihood f(y, 6). The 
parametric approach is based on the assumption that the underlying distribution P 
belongs to this family. The guasi maximum likelihood method applies the maximum 
likelihood approach for family (Pg) even if the underlying distribution P does 
not belong to this family. This leads again to the estimate 6 that maximizes the 
expression L(@) = >> €(¥;, 6) and is called the quasi MLE. It might happen that 
the true distribution belongs to some other parametric family for which one also 
can construct the MLE. However, there could be serious reasons for applying the 
quasi maximum likelihood approach even in this misspecified case. One of them is 
that the properties of the estimate 6 are essentially determined by the geometrical 
structure of the log-likelihood. The use of a parametric family with a nice geometric 
structure (which are quadratic or convex functions of the parameter) can seriously 
simplify the algorithmic burdens and improve the behavior of the method. 


2.10.1 LSE as Quasi Likelihood Estimation 


Consider the model 
Y, = 0* + ¢; (2.25) 


where 6* is the parameter of interest from R, and ¢; are random errors satisfying 
Ee; = 0. The assumption that ¢; are i.i.d. normal N(0, 07) leads to the quasi log- 
likelihood 


n 1 
L(6) = —5 log(2207) — oz yi - 9). 


Maximizing the expression L(0) leads to minimizing the sum of squared residuals 
(¥; — 0): 


, 1 
6 = argmin ) [(¥i — 8)” = pees 


This estimate is called a LSE or ordinary least squares estimate (OLSE). 


Example 2.10.1. Consider the model (2.25) with heterogeneous errors, that is, ¢; 
are independent normal with zero mean and variances o7. The corresponding log- 
likelihood reads 


a 2 
L°(6) = -5 Y flog@xa?) + oo. 


i 
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The MLE 6° is 


e= argmax L°(9) = N7! _ Vie: N= ya 
6 


We now compare the estimates 6 and 6°. 

Lemma 2.10.1. The following assertions hold for the estimate 6: 
1. 0 is unbiased: 9x0 = 0*. 7 

2. The quadratic risk of 0 is equal to the variance Var(0) given by 


RB, 0*) S Ege|6 — 0*|? = Var(6) =n?) o?. 
3. 6 is not R-efficient unless all o? are equal. 
Now we consider the MLE 6°. 
Lemma 2.10.2. The following assertions hold for the estimate 6°: 
1. 6° is unbiased: E96? = Q*. 7 
2. The quadratic risk of 0° is equal to the variance Var(@°) given by 


R(B°, 0*) & Egx|O° — 6*|? = Var(6°) = N~ yo =N7, 


3. 0° is R-efficient. 


Exercise 2.10.1. Check the statements of Lemmas 2.10.1 and 2.10.2. 
Hint: compute the Fisher information for the model (2.25) using the property of 
additivity: 


F(@) =) F°6) =) oo =N, 


where F(@) is the Fisher information in the marginal model Y; = 6 + e; with 
just one observation Y;. Apply the Cramér—Rao inequality for one observation of 
the vector Y. 


2.10.2 LAD and Robust Estimation as Quasi Likelihood 
Estimation 


Consider again the model (2.25). The classical least squares approach faces serious 
problems if the available data Y are contaminated with outliers. The reasons for 
contamination could be missing data or typing errors, etc. Unfortunately, even 
a single outlier can significantly disturb the sum L(@) and thus, the estimate @. 
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A typical approach proposed and developed by Huber is to apply another “influence 
function” y(Y; — 0) in the sum L(@) in place of the squared residual |Y; — 6|* 
leading to the M-estimate 


6 = argmin)) y(¥; — 4). (2.26) 
6 


A popular y-function for robust estimation is the absolute value |Y; — 6|. The 
resulting estimate 


6 = argmin Yy; —0 
gt I | 


is called LAD and the solution is the median of the empirical distribution P,,. 
Another proposal is called the Huber function: it is quadratic in a vicinity of zero 
and linear outside: 


x? if |x| <¢, 
v= Ils 


a|x| +5 otherwise. 


Exercise 2.10.2. Show that for each t > 0, the coefficients a = a(t) and b = b(t) 
can be selected to provide that (x) and its derivatives are continuous. 


A remarkable fact about this approach is that every such estimate can be viewed 
as a quasi MLE for the model (2.25). Indeed, for a given function w, define the 
measure Pp with the log-density €(y, 0) = —w(y — @). Then the log-likelihood is 
L(0) = — > w(Y; — 8) and the corresponding (quasi) MLE coincides with (2.26). 


Exercise 2.10.3. Suggest a o-finite measure jz such that exp{—-W(y - 0) is the 
density of Y; for the model (2.25) w.r.t. the measure ju. 
Hint: suppose for simplicity that 


Cy e / exp{—w(x)} dx < 00. 


Show that Cy ! exp{—v( yr 6) is a density w.r.t. the Lebesgue measure for any 0. 


Exercise 2.10.4. Show that the LAD 6 = argming >» |Y; — @| is the quasi MLE 
for the model (2.25) when the errors ¢; are assumed Laplacian (double exponential) 
with density p(x) = (1/2)e7P"!. 


2.11 Univariate Exponential Families 


Most parametric families considered in the previous sections are particular cases of 
exponential families (EF) distributions. This includes the Gaussian shift, Bernoulli, 
Poisson, exponential, volatility models. The notion of an EF already appeared in 
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the context of the Cramér—Rao inequality. Now we study such families in further 
detail. 

We say that P is an exponential family if all measures Py € P are dominated by 
ao-finite measure {4p on Y and the density functions p(y, 0) = dP /djo(y) are of 
the form 


def 


p(y.6) = jer ae, 


dP, 
ie “(y) = py 
Mo 


Here C(6) and B(@) are some given nondecreasing functions on © and p(y) is a 
nonnegative function on Y. 

Usually one assumes some regularity conditions on the family P. One possibility 
was already given when we discussed the Cramér—Rao inequality; see Defini- 
tion 2.5.1. Below we assume that condition is always fulfilled. It basically means 
that we can differentiate w.r.t. 9 under the integral sign. 

For an EF, the log-likelihood admits an especially simple representation, nearly 
linear in y: 


l(y, 8) = log p(y, 9) = yC(0) — B() + log p(y”) 


so that the log-likelihood ratio for 0, 6’ € © reads as 


def 


(y,0,0') = ey, 8) — ey, 6") = y[C(@) — C(8’)] - [B@) — BO')]. 


2.11.1 Natural Parametrization 


Let P = (Po) be an EF. By Y we denote one observation from the distribution 
Ps € P. In addition to the regularity conditions, one often assumes the natural 
parametrization for the family P which means the relation EgY = @. Note that 
this relation is fulfilled for all the examples of EF’s that we considered so far in 
the previous section. It is obvious that the natural parametrization is only possible 
if the following identifiability condition is fulfilled: for any two different measures 
from the considered parametric family, the corresponding mean values are different. 
Otherwise the natural parametrization is always possible: just define 0 as the 
expectation of Y. Below we use the abbreviation EFn for an exponential family 
with natural parametrization. 


2.11.1.1 Some Properties of an EFn 


The natural parametrization implies an important property for the functions B(@) 
and C(6). 
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Lemma 2.11.1. Let (Po) be a naturally parameterized EF. Then 
B'(0) = 6C’(@). 


Proof. Differentiating both sides of the equation f p(y,0)uo(dy) = 1 w.rt. 6 
yields 


0= / £yC'(0) — B'(8)} p(y. 8) 10(dy) 


2 / yC'(8) — B'(8)} Pa(dy) 
= 6C'(0) — B’(6) 


and the result follows. 


The next lemma computes the important characteristics of a natural EF such 
as the Kullback—Leibler divergence K(6,0’) = Ep log(p(Y, 0)/p(Y, 6’)), the 


Fisher information F(6) “ Eo\€'(Y,9)|?, and the rate function m(u, 0,6’) = 


—log Eg exp{we(Y, 0, 0’). 
Lemma 2.11.2. Let (P9) be an EFn. Then with 0, 0’ € © fixed, it holds for 
* the Kullback—Leibler divergence K(6, 0’) = Eg log(p(Y. 9)/p(¥, 0’)) : 


Rec p(y, 8) 
K(0, 6’) = ic 568) Po (dy) 
= {C(@)- C6} / yPy (dy) — {B(@) — B(6')} 
= 0{C(@) — C(6’)} — {B(6) — B(6’)}; (2.27) 


¢ the Fisher information F(0) € Eo\e'(Y, 0)? : 


F(6) = C’(6); 
* the rate function m(1, 8, 6’) = —log Eg exp{we(Y, 0, g’) : 
m(u, 0,0’) = K(0, 0 + (0' — 6); 
* the variance Varg(Y) : 


Varg(Y) = 1/F(@) = 1/C'(6). (2.28) 
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Proof. Differentiating the equality 


0= f(y —0yPaldy) = f(y — Be old 
w.r.t. 8 implies in view of Lemma 2.11.1 
1 = Ea[(Y — 6){C’(@)Y — B'(0)}] = C’(0)Ee(Y — 6)’. 


This yields Varg(Y) = 1/C’(0). This leads to the following representation of the 
Fisher information: 


F(0) = Varg[0’(Y, 0)] = Vare[C’(@)¥ — B’(@)] = [C'(0)] Varo(Y) = C'(0). 


Exercise 2.11.1. Check the equations for the Kullback—Leibler divergence and 
Fisher information from Lemma 2.11.2. 


2.11.1.2) MLE and Maximum Likelihood for an EFn 


Now we discuss the maximum likelihood estimation for a sample from an EFn. The 
log-likelihood can be represented in the form 


L(8) = log p(¥;,8) = C() 97 ¥; — BO) D3 1 + Do log p(%) (2.29) 


i=1 i=1 i=1 i=1 


= SC(6) —nB(6) + R, 


where 


S= wy, R= Sie) 


i=1 i=1 


The remainder term R is unimportant because it does not depend on 6 and thus it 
does not enter in the likelihood ratio. The MLE @ is defined by maximizing L(0) 
w.r.t. 0, that is, 


6 = argmax L(6) = argmax{SC(@) —nB(0)}. 
9cO 9cO 


In the case of an EF with the natural parametrization, this optimization problem 
admits a closed form solution given by the next theorem. 
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Theorem 2.11.1. Let (Ps) be an EFn. Then the MLE 6 fulfills 


§ = sjn=n SOM. 


i=l 
It holds 


Ep6 = 6, — Varg(6) = [nF(6)|"! = [nC/(@)]! 
so that 6 is R-efficient. Moreover, the fitted log-likelihood L(6, 0) e L(6) — L(@) 
Satisfies for any 0 € ©: 


L(6,0) =nK(6, 6). (2.30) 


Proof. Maximization of L(6) w.r.t. 6 leads to the estimating equation n B’(0) — 
SC'(@) = 0. This and the identity B’(@) = 8C’(@) yield the MLE 


6=S/n. 


The variance Vara (0) is computed using (2.28) from Lemma 2.11.2. The for- 
mula (2.27) for the Kullback—Leibler divergence and (2.29) yield the representa- 
tion (2.30) for the fitted log-likelihood L(6, 0) for any 6 € ©. 


One can see that the estimate 6 is the mean of the Y;’s. As for the Gaussian 
shift model, this estimate can be motivated by the fact that the expectation of every 
observation Y; under Pg is just @ and by the law of large numbers the empirical 
mean converges to its expectation as the sample size n grows. 


2.11.2. Canonical Parametrization 


Another useful representation of an EF is given by the so-called canonical 
parametrization. We say that v is the canonical parameter for this EF if the density 
of each measure P,, w.r.t. the dominating measure jlo is of the form: 


def dP, 
p(y, v) = 7 
Mo 


(vy) = p(y) exp{yu —d(v)}. 


Here d(v) is a given convex function on © and p(y) is a nonnegative function on 
Y. The abbreviation EFc will indicate an EF with the canonical parametrization. 
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2.11.2.1 Some Properties of an EFc 


The next relation is an obvious corollary of the definition: 


Lemma 2.11.3. An EFn (Pe) always permits a unique canonical representation. 
The canonical parameter v is related to the natural parameter 0 by v = C(@), 


d(v) = B(O@) and @ = d'(v). 


Proof. The first two relations follow from the definition. They imply B’(0) = 
d'(v)-du/d0 = d'(v)-C’(@) and the last statement follows from B’(@) = 0C’(@). 


The log-likelihood ratio £(y, v, v,) for an EFc reads as 
L(Y, v, v1) = Yu — vj) — d(v) + d(vy). 


The next lemma collects some useful facts about an EFc. 


Lemma 2.11.4. Let P = (Py VE U) be an EFc and let the function d(-) be two 
times continuously differentiable. Then it holds for any v,v, € U: 


(i). The mean E,,Y and the variance Var, (Y ) fulfill 
EyY =d'(v), Vary (Y) = E,(Y — E,Y)° = d"(v). 


(ii). The Fisher information F(v) as 


E, |'(Y, v)/? satisfies 
F(v) = d"(v). 
(iii). The Kullback—Leibler divergence K°(v, 1) = E,L(Y, v, v1) satisfies 


eg 7 p(y, v) 
TSE [vs p(y. v1) 


= d'(v)(v—u) — {d(v) —d(u)} 
= d"(¥) (uv, —v)?/2, 


P,, (dy) 


where 0 is a point between v and v,. Moreover, for v < vy; € U 
v’I 
K°(v, v1) = / (v,; —u)d” (u) du. 
U 


(iv). The rate function m(kL, U1, Vv) e_ log Ey exp{we(Y, U1, v)} fulfills 


ML, V1, Vv) = WK (v, v1) — K°(v, v + wv; — v)) 
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Table 2.1 v(6), d(v), F(v) = d’”(v) and 6 = 6(v) for the examples from Sect. 2.9 


Model v d(v) I(v) 6(v) 
Gaussian regression 6/7 va7/2 o ou 
Bernoulli model log(6/(1 = 6)) log(1 + e”) ev’/I +e) e’/A +e?) 
Poisson model log é ev ev ev 
Exponential model 1/0 —logu 1/v? 1/v 
Volatility model —1/(26) —} log(—2v) 1/(2v7) —1/(2v) 


Proof. Differentiating the equation [ p(y, v)4o(dy) = 1 w.rt. v yields 


i — d'(v)} p(y, v) mo(dy) = 0, 


that is, E,Y = d'(v). The expression for the variance can be proved by one more 
differentiating of this equation. Similarly one can check (ii). The item (iii) can be 
checked by simple algebra and (iv) follows from (7). 

Further, for any v, v; € U, it holds 


L(Y, v1, v) — E,e(Y, v1, v) = (uv. — v){¥ —d’(v)} 
and with u = ju(v, — v) 
log E, exp{u(Y — d’(v))} 
= —ud'(v) + d(v + u) — d(v) + log E, exp{uY — d(v + u) + d(v)} 
= d(u+u)—d(v) —ud'(v) = K*(v,v + uv), 
because 


dP, u 
Ey exp{uY —d(u+u)+ d(v)} = Ey a = 1 


and (iv) follows by (iii). 


Table 2.1 presents the canonical parameter and the Fisher information for the 
examples of exponential families from Sect. 2.9. 


Exercise 2.11.2. Check (iii) and (iv) in Lemma 2.11.4. 
Exercise 2.11.3. Check the entries of Table 2.1. 
Exercise 2.11.4. Check that K°(v, v’) = K(6(v), 0(v’)). 


Exercise 2.11.5. Plot K°(v*, v) as a function of v for the families from Table 2.1. 
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2.11.2.2 Maximum Likelihood Estimation for an EFc 


The structure of the log-likelihood in the case of the canonical parametrization is 
particularly simple: 


L(v) = ) log p(¥%i,v) =v ))¥; —d(v) 901 + Yo log pH) 


i=1 i=1 i=1 i=] 


= Su—nd(v)+R 
where 
S=)CY, R=) log p(%). 
i=1 i=1 


Again, as in the case of an EFn, we can ignore the remainder term R. The estimating 
equation dL(v)/dvu = 0 for the maximum likelihood estimate U reads as 


d'(v) = S/n. 


This and the relation 6 = d’(v) lead to the following result. 


Theorem 2.11.2. The MLEs 6 andi for the natural and canonical parametrization 
are related by the equations 


6=d'(v) t=C(). 


The next result describes the structure of the fitted log-likelihood and basically 
repeats the result of Theorem 2.11.1. 
Theorem 2.11.3. Let (P,) be an EF with canonical parametrization. Then for any 
v € U the fitted log-likelihood L(@, v) © max, L(v’, v) satisfies 
L(v,v) =nK* (0, v). 


Exercise 2.11.6. Check the statement of Theorem 2.11.3. 


2.11.3 Deviation Probabilities for the Maximum Likelihood 


Let Yj,...,Y, be iid. observations from an EF ?. This section presents a 
probability bound for the fitted likelihood. To be more specific we assume that P is 
canonically parameterized, P = (P,,). However, the bound applies to the natural and 
any other parametrization because the value of maximum of the likelihood process 
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L(@) does not depend on the choice of parametrization. The log-likelihood ratio 
L(v’,v) is given by the expression (2.29) and its maximum over v’ leads to the 
fitted log-likelihood L(v, v) = nK*(v, v). 

Our first result concerns a deviation bound for L(U, v). It utilizes the representa- 
tion for the fitted log-likelihood given by Theorem 2.11.1. As usual, we assume that 
the family P is regular. In addition, we require the following condition. 


(Pc) P= (Py,u € U C R) is a regular EF. The parameter set U is convex. The 
function d(v) is two times continuously differentiable and the Fisher information 
F(v) = d”(v) satisfies F(v) > 0 for all v. 


The condition (Pc) implies that for any compact set Up there is a constant a = 
a(Uo) > O such that 


|F(v1)/F(v2)|'”? < a, Vi, U2 € Up. 


Theorem 2.11.4. Let Y; be i.i.d. from a distribution Py which belongs to an EFc 
satisfying (Pc). For any 3 > 0 


Py» (L(v, v*) > 3) = Py (nK°(6,u*) > 3) < 2e74. 


Proof. The proof is based on two properties of the log-likelihood. The first one is 
that the expectation of the likelihood ratio is just one: E,* exp L(v, v*) = 1. This 
and the exponential Markov inequality imply for 3 > 0 


Py«(L(u, v*) > 3) <2. (2.31) 


The second property is specific to the considered univariate EF and is based on 
geometric properties of the log-likelihood function: linearity in the observations Y; 
and convexity in the parameter vu. We formulate this important fact in a separate 
statement. 


Lemma 2.11.5. Let the EFc P fulfill (Pc). For given 3 and any vo € U, there exist 
two values vt > vp and v~ < v9 satisfying K°(v*, vp) = 3/n such that 


{L(, vo) > 3} S {L(v*, up) > 3} U{L(u™, vo) > 3}. 
Proof. It holds 
{L(0, vo) > 3} = {sup[ S(v — U9) — n{d(v) = d(vo)} | > 3} 


ss ele 3 +n{d(v) _ d(vo)} . 
v<ug Vo —U 


CYS > inf snk) — deo U }-s 


U>Vv0 U— U0 
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Define for every u > 0 


fw = 3+ n{d(vo + u) — d(vo)} 


u 


This function attains its minimum at a point u satisfying the equation 
3/n + d(vo + u) — d(vo) — d'(vp + u)u = 0 
or, equivalently, 
K(up + u, Uo) = 3/N. 


The condition (Pc) provides that there is only one solution u > 0 of this equation. 


Exercise 2.11.7. Check that the equation K(up+u, vo) = 3/n has only one positive 
solution for any 3 > 0. 
Hint: use that K(vp + u, vo) is a convex function of u with minimum at u = 0. 


Now, it holds with ut = vp +u 


{5 i See trie cell} _ ‘5 a stale el} 
u>v0 VU — U9 Ur — Vo 
c {L(uT, vp) > 3}. 
Similarly 
}-s ae eat = i-s . sere 
U<U0 Up — UV Uo — U 


cS {L(u, vp) > 3}. 


for some v- < Up. 


The assertion of the theorem is now easy to obtain. Indeed, 
Py» (L(0, v*) = 3) < Pyx(L(v*, u*) = 3) + Pux(L(u7, v*) > 3) < 2e% 


yielding the result. 
Exercise 2.11.8. Let (P,,) be a Gaussian shift experiment, that is, Py) = N(v, 1). 
* Check that L(0, v) = n|b — v|?/2; 


¢ Given 3 = 0, find the points vt and u~ such that 


{L(0, uv") > 33 C {L(vt, v*) > 3} U{L(v7, v*) > 3}. 
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* Plot the mentioned sets {uv : L(t,v) > 3}, {u : L(ut,v) > 3}, and {vu : 
L(u~,v) > 3} as functions of v fora fixed S = > Y;. 


Remark 2.11.1. Note that the mentioned result only utilizes the geometric structure 
of the univariate EFc. The most important feature of the log-likelihood ratio 
L(v, v*) = S(v — v*) — d(v) + d(v%) is its linearity w.r.t. the stochastic term 
S. This allows us to replace the maximum over the whole set U by the maximum 
over the set consisting of two points v*. Note that the proof does not rely on the 
distribution of the observations Y;. In particular, Lemma 2.11.5 continues to hold 
even within the quasi likelihood approach when L(v) is not the true log-likelihood. 
However, the bound (2.31) relies on the nature of L(vu, v*). Namely, it utilizes that 
Bexp{L(u=, v*)\ = 1, which is true under P = P,« nut generally false in the 
quasi likelihood setup. Nevertheless, the exponential bound can be extended to the 
quasi likelihood approach under the condition of bounded exponential moments for 
L(v, v*): for some fz > 0, it should hold Eexp{uL(v, v*)} = C(L) < oo. 


Theorem 2.11.4 yields a simple construction of a confidence interval for the 
parameter v* and the concentration property of the MLE v. 


Theorem 2.11.5. Let Y; be i.i.d. from Py» € P with P satisfying (Pc). 


1. If 3q satisfies e~ 8 < a/2, then 
E(3a) = {v :nK°(6,v) < 3a} 


is an a-confidence set for the parameter v*. 
2. Define for any 3 > 0 the set Ag, uv*) = {uv : K°(v, v*) < 3/n}. Then 


Py» (v ¢ Ag, v*)) < 2e%. 


The second assertion of the theorem claims that the estimate U belongs with 
a high probability to the vicinity A(j,v*) of the central point v* defined by the 
Kullback—Leibler divergence. Due to Lemma 2.11.4 (iii) K°(v, v*) & F(u*) (v — 
u*)?/2, where F(v*) is the Fisher information at v*. This vicinity is an interval 
around vu* of length of order n~'/?. In other words, this result implies the rootz 
consistency of Uv. 

The deviation bound for the fitted log-likelihood from Theorem 2.11.4 can be 
viewed as a bound for the normalized loss of the estimate U. Indeed, define the loss 
function 9(v’, v) = K!/2(v’, v). Then Theorem 2.11.4 yields that the loss is with 
high probability bounded by ¥3/n provided that 3 is sufficiently large. Similarly 
one can establish the bound for the risk. 


Theorem 2.11.6. Let Y; be i.i.d. from the distribution P,,* which belongs to a 
canonically parameterized EF satisfying (Pc). The following properties hold: 


(i). For any r > 0 there is a constant t, such that 


E,«L' (0, v*) = n"Ey«K' (0, v*) <t,. 
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(ii). For every’ <1 


E,« exp{AL (0, v*)} = E,» exp{AnK(6, v*)} < (1 +)/(1—A). 


Proof. By Theorem 2.11.4 
E,*L" (0, uv") = -| 3d P,x{L(o, v*) > 3} 
320 
= a a Phyl bow) 3hd3 
320 


r / 23’ 1e3d3 
320 


and the first assertion is fulfilled with t, = 2r bao 3 
proved similarly. ~ 


lA 


'—le-3d3. The assertion (ii) is 


2.11.3.1 Deviation Bound for Other Parameterizations 


The results for the maximum likelihood and their corollaries have been stated for 
an EFc. An immediate question that arises in this respect is whether the use of the 
canonical parametrization is essential. The answer is “no”: a similar result can be 
stated for any EF whatever the parametrization is used. This fact is based on the 
simple observation that the maximum likelihood is the value of the maximum of the 
likelihood process; this value does not depend on the parametrization. 


Lemma 2.11.6. Let (Po) be an EF. Then for any 0 
L(6,0) = nK(P5, Po). (2.32) 


Exercise 2.11.9. Check the result of Lemma 2.11.6. 
Hint: use that both sides of (2.32) depend only on measures Pz, Pg and not on the 
parametrization. 


Below we write as before KO, 0) instead of K(P5;, Pg). The property (2.32) and 
the exponential bound of Theorem 2.11.4 imply the bound for a general EF: 


Theorem 2.11.7. Let (P9) be a univariate EF. Then for any 3 > 0 


Pox (L(6, 0*) > 3) = Pox(nK(6,0*) > 3) <2e7. 


This result allows us to build confidence sets for the parameter 6* and concen- 
tration sets for the MLE @ in terms of the Kullback—Leibler divergence: 
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A(g, 0*) = {0 : K(8,0*) < 3/n}, 
EG) = {0 : K(8,0) <3/n}. 


Corollary 2.11.1. Let (Ps) be an EF. If e~3« = a/2, then 
Poe (0 ¢ Alga. 6*)) < a, 
and 
Pox (EGa) #9) <a. 
Moreover, for anyr > 0 


Eg» L’ (6, 0*) = n’ Eg«K’ (6, 0*) Sth. 


2.11.3.2 Asymptotic Against Likelihood-Based Approach 


The asymptotic approach recommends to apply symmetric confidence and concen- 
tration sets with width of order [nF(6*)]~!/?: 


An(3,0*) = {0 : F(6*) (6 — 0*) < 23/n}, 
En(3) = {0 : F(6*) (6 — 6)" < 23/n}, 
€,) ={0: 1(6) (0-6)? < 23/n}. 


Then asymptotically, i.e. for large 1, these sets do approximately the same job as the 
non-asymptotic sets A(3, 0*) and E(3). However, the difference for finite samples 
can be quite significant. In particular, for some cases, e.g. the Bernoulli of Poisson 
families, the sets A,,(3,0*) and €/ (3) may extend beyond the parameter set 0. 


2.12 Historical Remarks and Further Reading 


The main part of the chapter is inspired by the nice textbook (Borokov, 1998). The 
concept of exponential families is credited to Edwin Pitman, Georges Darmois, and 
Bernard Koopman in 1935-1936. 

The notion of Kullback—Leibler divergence was originally introduced by 
Solomon Kullback and Richard Leibler in 1951 as the directed divergence between 
two distributions. Many of its useful properties are studied in monograph (Kullback, 
1997). 

The Fisher information was discussed by several early statisticians, notably 
Francis Edgeworth. Maximum-likelihood estimation was recommended, analyzed, 
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and vastly popularized by Robert Fisher between 1912 and 1922, although it 
had been used earlier by Carl Gauss, Pierre-Simon Laplace, Thorvald Thiele, and 
Francis Edgeworth. 

The Cramér—Rao inequality was independently obtained by Maurice Fréchet, 
Calyampudi Rao, and Harald Cramér around 1943-1945. 

For further reading we recommend textbooks by Lehmann and Casella (1998), 
Borokov (1998), and Strasser (1985). The deviation bound of Theorem 2.11.4 
follows Polzehl and Spokoiny (2006). 


Chapter 3 
Regression Estimation 


This chapter discusses the estimation problem for the regression model. First a linear 
regression model is considered, then a generalized linear modeling is discussed. We 
also mention median and quantile regression. 


3.1 Regression Model 


The (mean) regression model can be written in the form E(Y|X) = f(X), or 
equivalently, 


Y= f(X) +e, (3.1) 


where Y is the dependent (explained) variable and X is the explanatory variable 
(regressor) which can be multidimensional. The target of analysis is the systematic 
dependence of the explained variable Y from the explanatory variable X. The 
regression function f describes the dependence of the mean of Y as a function of X. 
The value ¢ can be treated as an individual deviation (error). It is usually assumed 
to be random with zero mean. Below we discuss in more detail the components of 
the regression model (3.1). 


3.1.1 Observations 


In almost all practical situations, regression analysis is performed on the basis of 
available data (observations) given in the form of a sample of pairs (X;, Y;) for 
i = 1,...,n, where n is the sample size. Here Y;,...,Y, are observed values 


V. Spokoiny and T. Dickhaus, Basics of Modern Mathematical Statistics, dD 
Springer Texts in Statistics, DOI 10.1007/978-3-642-39909-1_ 3, 
© Springer-Verlag Berlin Heidelberg 2015 
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of the regression variable Y and X),..., X, are the corresponding values of the 
explanatory variable X. For each observation Y;, the regression model reads as: 


Yi = f(Xi) + & 


where ¢; is the individual 7th error. 


3.1.2 Design 


The set X,,..., X, of the regressor’s values is called a design. The set X of all 
possible values of the regressor X is called the design space. If this set X is compact, 
then one speaks of a compactly supported design. 

The nature of the design can be different for different statistical models. However, 
it is important to mention that the design is always observable. Two kinds of design 
assumptions are usually used in statistical modeling. A deterministic design assumes 


that the points X;,..., X;, are nonrandom and given in advance. Here are typical 
examples: 

Example 3.1.1 (Time Series). Let Y,,,Yij+1,...,¥r be a time series. The time 
points fo, fo +1,..., 7 build a regular deterministic design. The regression function 


f explains the trend of the time series Y; as a function of time. 


Example 3.1.2 (Imaging). Let Y;; be the observed gray value at the pixel (i, 7) of 
an image. The coordinate X;; of this pixel is the corresponding design value. The 
regression function f(X;;) gives the true image value at X;; which is to be recovered 
from the noisy observations Y;j;. 


If the design is supported on a cube in R@ and the design points X; form a grid 
in this cube, then the design is called equidistant. An important feature of such a 
design is that the number Ny of design points in any “massive” subset A of the 
unit cube is nearly the volume of this subset V4 multiplied by the sample size n: 
N4 x nV4. Design regularity means that the value N4 is nearly proportional to 
nV4, that is, N4 ~ cnV, for some positive constant c which may depend on the 
set A. 

In some applications, it is natural to assume that the design values X; are 
randomly drawn from some design distribution. Typical examples are given by 
sociological studies. In this case one speaks of a random design. The design values 
X\,...,X, are assumed to be independent and identically distributed from a law 
Py on the design space X which is a subset of the Euclidean space IR@. The design 
variables X are also assumed to be independent of the observations Y. 

One special case of random design is the uniform design when the design 
distribution is uniform on the unit cube in R%. The uniform design possesses a 
similar, important property to an equidistant design: the number of design points in 
a “massive” subset of the unit cube is on average close to the volume of this set 
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multiplied by n. The random design is called regular on X if the design distribution 
is absolutely continuous with respect to the Lebesgue measure and the design 
density p(x) = dPy(x)/dA is positive and continuous on X. This again ensures 
with a probability close to one the regularity property N4 ~ cnV4 with c = p(x) 
for some x € A. 

It is worth mentioning that the case of a random design can be reduced to the 
case of a deterministic design by considering the conditional distribution of the data 
given the design variables Xj,..., Xn. 


3.1.3 Errors 


The decomposition of the observed response variable Y into the systematic compo- 
nent f(x) and the error e in the model equation (3.1) is not formally defined and 
cannot be done without some assumptions on the errors ¢;. The standard approach 
is to assume that the mean value of every ¢; is zero. Equivalently this means that the 
expected value of the observation Y; is just the regression function f(X;). This case 
is called mean regression or simply regression. It is usually assumed that the errors 
€; have finite second moments. Homogeneous errors case means that all the errors 
e; have the same variance o* = Var e?. The variance of heterogeneous errors €; may 
vary with 7. In many applications not only the systematic component f(X;) = EY; 
but also the error variance Var Y; = Vare; depend on the regressor (location) X;. 
Such models are often written in the form 


Y; = f(X;) + o(Xi)e; - 


The observation (noise) variance o7(x) can be the target of analysis similarly to the 
mean regression function. 

The assumption of zero mean noise, Ee; = 0, is very natural and has a clear 
interpretation. However, in some applications, it can cause trouble, especially if data 
are contaminated by outliers. In this case, the assumption of a zero mean can be 
replaced by a more robust assumption of a zero median. This leads to the median 
regression model which assumes P(e; < 0) = 1/2, or, equivalently 


P(Y; — f(X;) < 0) = 1/2. 


A further important assumption concerns the joint distribution of the errors ¢;. In 
the majority of applications the errors are assumed to be independent. However, in 
some situations, the dependence of the errors is quite natural. One example can be 
given by time series analysis. The errors ¢; are defined as the difference between 
the observed values Y; and the trend function f; at the ith time moment. These 
errors are often serially correlated and indicate short or long range dependence. 
Another example comes from imaging. The neighbor observations in an image are 
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often correlated due to the imaging technique used for recoding the images. The 
correlation particularly results from the automatic movement correction. 

For theoretical study one often assumes that the errors ¢; are not only independent 
but also identically distributed. This, of course, yields a homogeneous noise. The 
theoretical study can be simplified even further if the error distribution is normal. 
This case is called Gaussian regression and is denoted as ¢; ~ N(0,07). This 
assumption is very useful and greatly simplifies the theoretical study. The main 
advantage of Gaussian noise is that the observations and their linear combinations 
are also normally distributed. This is an exclusive property of the normal law which 
helps to simplify the exposition and avoid technicalities. 

Under the given distribution of the errors, the joint distribution of the observa- 
tions Y; is determined by the regression function f(-). 


3.1.4 Regression Function 


By Eq. (3.1), the regression variable Y can be decomposed into a systematic 
component and a (random) error ¢. The systematic component is a deterministic 
function f of the explanatory variable X called the regression function. Classical 
regression theory considers the case of linear dependence, that is, one fits a linear 
relation between Y and X: 


f(x) =a+ bx 
leading to the model equation 
Y; =0,+0X; +6. 


Here 6; and @2 are the parameters of the linear model. If the regressor x is 
multidimensional, then 6 is a vector from R“ and 6x becomes the scalar product 
of two vectors. In many practical examples the assumption of linear dependence is 
too restrictive. It can be extended by several ways. One can try a more sophisticated 
functional dependence of Y on X, for instance polynomial. More generally, one 
can assume that the regression function f is known up to the finite-dimensional 
parameter 0 = (6),...,6,)' € IR. This situation is called parametric regression 
and denoted by f(-) = f(-, 6). If the function f(-,@) depends on @ linearly, that 
is, f(x,0) = Owi(x) +... + 8,W)(x) for some given functions W,..., Wp, 
then the model is called linear regression. An important special case is given 
by polynomial regression when f(x) is a polynomial function of degree p — 1: 
f(x) =O) + ox +... + OpxP I, 

In many applications a parametric form of the regression function cannot be 
justified. Then one speaks of nonparametric regression. 
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3.2 Method of Substitution and M-Estimation 


Observe that the parametric regression equation can be rewritten as 
= Y; — f(Xi, 9). 


If 6 is an estimate of the parameter 0, then the residuals €; = Y; — f(Xj, 6) are 
estimates of the individual errors ¢;. So, the idea of the method is to select the 
parameter estimate 6 ina way that the empirical distribution P,, of the residuals é; 
mimics as well as possible certain prescribed features of the error distribution. We 
consider one approach called minimum contrast or M-estimation. Let w(y) be an 
influence or contrast function. The main condition on the choice of this function is 
that 


Ew(e +z) = Ey(e) 


for alli = 1,..., and all z. Then the true value @* clearly minimizes the 
expectation of the sum )°, V(Yi — f(Xi, 0)): 


o* = argminE )) y(Y; — f(X;,6)). 
8 i 


This leads to the M-estimate 


6 = i Y; — f(Xi, )). 
go ed F(X. 9) 


This estimation method can be treated as replacing the true expectation of the errors 
by the empirical distribution of the residuals. 

We specify this approach for regression estimation by the classical examples of 
least squares, least absolute deviation (LAD) and maximum likelihood estimation 
corresponding to w(x) = x7, W(x) = |x| and w(x) = —log p(x), where p(x) is 
the error density. All these examples belong within framework of M-estimation and 
the quasi maximum likelihood approach. 


3.2.1 Mean Regression: Least Squares Estimate 


The observations Y; are assumed to follow the model 


Y= f(X%i,0")+e, Ee; =0 (3.2) 
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with an unknown target 0*. Suppose in addition that a? = Ke? < oo. Then for 
every 8 € © and every i <n due to (3.2) 


2 x 2 
Eg«{¥; — f(Xi,0)} = Ege; + f(Xi,0*) — f (Xi, 0)} 
af 2 

= 07 + | f(Xi,0*) — f(Xi, 0). 

This yields for the whole sample 
2 x 2 
Eyx > {¥i — £(%, 0)}" = Do {07 + | F(X, 0") — (Xi, 0}. 

This expression is clearly minimized at 9 = 6*. This leads to the idea of estimating 


the parameter 6* by maximizing its empirical counterpart. The resulting estimate is 
called the (ordinary) least squares estimate (LSE): 


Oise = argmin ) | {Y; — f(Xi, ay). 
=e 


This estimate is very natural and requires minimal information about the errors ¢;. 
Namely, one only needs Ke; = 0 and Ee? < OO. 


3.2.2 Median Regression: LAD Estimate 


Consider the same regression model as in (3.2), but the errors ¢; are not zero-mean. 
Instead we assume that their median is zero: 


Yi = f (Xi, 0") +6, med(e;) = 0. 
As previously, the target of estimation is the parameter 0*. Observe that ¢; = Y; — 
J (X;,0*) and hence, the latter rv. has median zero. We now use the following 
simple fact: if med(e) = 0, then for any z 4 0 
Ele + z| > Eel. (3.3) 
Exercise 3.2.1. Prove (3.3). 
The property (3.3) implies for every 0 


Ege )o|¥i — f(%1,8)| = Ege D)|¥i — f(%i, 0") 


’ 


that is, 9@* minimizes over 6 the expectation under the true measure of the sum 
>| Y; — f(%, 6) |. This leads to the empirical counterpart of 0* given by 
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6 =argmin ) \|¥; — f(X;,0)|. 
6E€0 


This procedure is usually referred to as LADs regression estimate. 


3.2.3. Maximum Likelihood Regression Estimation 


Let the density function p(-) of the errors ¢; be known. The regression equation (3.2) 
implies e; = Y; — f(X;, 0*). Therefore, every Y; has the density p(y — f(X;, 0*)). 
Independence of the Y;’s implies the product structure of the density of the joint 
distribution: 


] [oi — £%. 9). 


yielding the log-likelihood 


L(6) = 9 ei — £(%i, 9) 


with £(t) = log p(t). The maximum likelihood estimate (MLE) is the point of 
maximum of L(@): 


6 = argmax L(0) = argmax ) ° ¢(Y; — f(Xi,9)). 
Tc) Tc) 


A closed form solution for this equation exists only in some special cases like linear 

Gaussian regression. Otherwise this equation has to be solved numerically. 
Consider an important special case corresponding to the i.i.d. Gaussian errors 

when p(y) is the density of the normal law with mean zero and variance o”. Then 


n 1 
LO) = —5 log(2x0”) — = |i — F(X, 0))’. 


The corresponding MLE maximizes L(@) or, equivalently, minimizes the sum 


> |¥i — 4%, 0)": 
6 = argmax L(6) = argmin ) °|¥; — f(Xi, 6)|°. (3.4) 
Tc) Tc) 


This estimate has already been introduced as the ordinary least squares estimate 
(oLSE). 

An extension of the previous example is given by inhomogeneous Gaussian 
regression, when the errors ¢; are independent Gaussian zero-mean but the variances 
depend on i: Ee? = 07. Then the log-likelihood L(@) is given by the sum 
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¥; = OG,0)/ 
10) = YI! re u — Flos(2n0?)}. 


Maximizing this expression w.r.t. 6 is equivalent to minimizing the weighted sum 


Ya (An 0)|" 


6 = i ly = #0. 0)P. 
argmin ) 6; | FC )| 


Such an estimate is also called the weighted least squares (wLSE). 

Another example corresponds to the case when the errors ¢; are i.i.d. double 
exponential, so that P(-te; > t) = e~*/° for some given o > 0. Then p(y) = 
(20)~!e—!7I/2 and 


L(@) = —nlog(20) —o7' S“|¥; — f (Xi, 6)]. 
The MLE 6 maximizes L (0) or, equivalently, minimizes the sum )> | Y;—f (Xj, 4) i 
6 = argmax L(@) = argmin ) *|Y; - f (Xi, 0)|. 
rite) 6<O 


So the maximum likelihood regression with Laplacian errors leads back to the LADs 
estimate. 


3.2.4 Quasi Maximum Likelihood Approach 


This section very briefly discusses an extension of the maximum likelihood 
approach. A more detailed discussion will be given in context of linear modeling 
in Chap. 4. To be specific, consider a regression model 


¥, = f (Xi) + i. 


The maximum likelihood approach requires to specify the two main ingredients 
of this model: a parametric class {f(x,6),@ © ©} of regression functions and 
the distribution of the errors ¢;. Sometimes such information is lacking. One or 
even both modeling assumptions can be misspecified. In such situations one speaks 
of a quasi maximum likelihood approach, where the estimate @ is defined via 
maximizing over 6 the random function L(@) even though it is not necessarily the 
real log-likelihood. Some examples of this approach have already been given. 
Below we distinguish between misspecification of the first and second kind. The 
first kind corresponds to the parametric assumption about the regression function: 
assumed is the equality f(X;) = f(X;,0*) for some 0* € ©. In reality one 
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can only expect a reasonable quality of approximating f(-) by f(-,0*). A typical 
example is given by linear (polynomial) regression. The linear structure of the 
regression function is useful and tractable but it can only be a rough approximation 
of the real relation between Y and X. The quasi maximum likelihood approach 
suggests to ignore this misspecification and proceed as if the parametric assumption 
is fulfilled. This approach raises a number of questions: what is the target of 
estimation and what is really estimated by such quasi ML procedure? In Chap. 4 we 
show in the context of linear modeling that the target of estimation can be naturally 
defined as the parameter 0* providing the best approximation of the true regression 
function f(-) by its parametric counterpart f(-, 0). 

The second kind of misspecification concerns the assumption about the errors 
€;. In most of the applications, the distribution of errors is unknown. Moreover, 
the errors can be dependent or non-identically distributed. Assumption of a specific 
iid. structure leads to a model misspecification and thus, to the quasi maximum 
likelihood approach. We illustrate this situation by few examples. 

Consider the regression model Y; = f(X;,0*) + €; and suppose for a moment 
that the errors ¢; are i.i.d. normal. Then the principal term of the corresponding log- 
likelihood is given by the negative sum of the squared residuals: }~ | Y; — f (Xj, 9) . 
and its maximization leads to the least squares method. So, one can say that the LSE 
method is the quasi MLE when the errors are assumed to be i.i1.d. normal. That is, 
the LSE can be obtained as the MLE for the imaginary Gaussian regression model 
when the errors ¢; are not necessarily 1.i.d. Gaussian. 

If the data are contaminated or the errors have heavy tails, it could be unwise 
to apply the LSE method. The LAD method is known to be more robust against 
outliers and data contamination. At the same time, it has already been shown in 
Sect. 3.2.3 that the LAD estimates is the MLE when the errors are Laplacian (double 
exponential). In other words, LAD is the quasi MLE for the model with Laplacian 
errors. 

Inference for the quasi ML approach is discussed in detail in Chap.4 in the 
context of linear modeling. 


3.3 Linear Regression 


One standard way of modeling the regression relationship is based on a linear 
expansion of the regression function. This approach is based on the assumption that 
the unknown regression function f(-) can be represented as a linear combination of 
given basis functions W1(-),..., Wp(-): 


F(x) = AWi(x) +... + Op W p(x). 


A couple of popular examples are listed in this section. More examples are given 
below in Sect. 3.3.1 in context of projection estimation. 
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Example 3.3.1 (Multivariate Linear Regression). Let x = (xX,... xa) be d- 
dimensional. The linear regression function f(x) can be written as 


F(x) =a+ bx, +... + bax. 


Here we have p = d + 1 and the basis functions are (x) = 1 and Wy = Xm—1 
form = 2,...,p. The coefficient a is often called the intercept and b,,..., bq 
are the slope coefficients. The vector of coefficients @ = (a,b,,...,ba)' uniquely 
describes the linear relation. 


Example 3.3.2 (Polynomial Regression). Let x be univariate and f(-) be a polyno- 
mial function of degree p — 1, that is, 


f(x) =O; + Ox +... + Opx?P |. 


Then the basic functions are W(x) = 1, Wo(x) = x, Wp(x) = x?7!, while 0 = 
(61,..., 0,)* is the corresponding vector of coefficients. 


Exercise 3.3.1. Let the regressor x be d-dimensional, x = (igicss ag) 
Describe the basis system and the corresponding vector of coefficients for the 
case when f is a quadratic function of x. 


Linear regression is often described using vector—matrix notation. Let W; be the 
vector in R? whose entries are the values yy, (X;) of the basis functions at the design 
point X;,m = 1,...,p.Then f(X;) = wre * and the linear regression model can 
be written as 


¥,=W/0* +e, i=l,...,n. 
Denote by Y = (Yj,...,Y,)! the vector of observations (responses), and « = 
(€],... ven) the vector of errors. Let finally Y be the p x n matrix with columns 


i=1,.... 


W,,..., WV, that is, VY = (Vin (X;)) =1 ap Note that each row of W is composed 
by the values of the corresponding basis function yw, at the design points X;. Now 
the regression equation reads as 


Y=W'O* +e. 


The estimation problem for this linear model will be discussed in detail in Chap. 4. 


3.3.1 Projection Estimation 


Consider a (mean) regression model 


R= free TS Tyasyt. (3.5) 
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The target of the analysis is the unknown nonparametric function f which has 
to be recovered from the noisy data Y. This approach is usually considered 
within the nonparametric statistical theory because it avoids fixing any parametric 
specification of the model function f, and thus, of the distribution of the data Y. 
This section discusses how this nonparametric problem can be put back into the 
parametric theory. 

The standard way of estimating the regression function f is based on some 
smoothness assumption about this function. It enables us to expand the given 
function w.r.t. some given functional basis and to evaluate the accuracy of approxi- 
mation by finite sums. More precisely, let 1 (x),..., Wn(x),... be a given system 
of functions. Specific examples are trigonometric (Fourier, cosine), orthogonal 
polynomial (Chebyshev, Legendre, Jacobi), and wavelet systems among many 
others. The completeness of this system means that a given function f can be 
uniquely expanded in the form 


IQS > Ola): (3.6) 


m=1 


A very desirable feature of the basis system is orthogonality: 


/ Ya 2) Vni CO)tac(ds) = 0, mom’ 


Here jzx can be some design measure on X or the empirical design measure 
n—' >> 5y,. However, the expansion (3.6) is untractable because it involves infinitely 
many coefficients 6,,. A standard procedure is to truncate this expansion after the 
first p terms leading to the finite approximation 


Pp 
f(x) © Yo On Vin (x). (3.7) 


m=1 


Accuracy of such an approximation becomes better and better as the number p of 
terms grows. A smoothness assumption helps to estimate the rate of convergence to 
zero of the remainder term f — 0) —...— @pWp: 


| f-81v1 —...- Ov] Srp. (3.8) 


where r, describes the accuracy of approximation of the function f by the 
considered finite sums uniformly over the class of functions with the prescribed 
smoothness. The norm used in the definition (3.8) as well as the basis {y,,} depends 
on the particular smoothness class. Popular examples are given by Hélder classes 
for the L., norm, Sobolev smoothness for L2-norm or more generally L,-norm for 
some s > 1. 
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A choice of a proper truncation value p is one of the central problems in non- 
parametric function estimation. With p growing, the quality of approximation (3.7) 
improves in the sense that r, — 0 as p — oo. However, the growth of the 
parameter dimension yields the growth of model complexity, one has to estimate 
more and more coefficients. Section 4.7 below briefly discusses how the problem 
can be formalized and how one can define the optimal choice. However, a rigorous 
solution is postponed until the next volume. Here we suppose that the value p 
is fixed by some reasons and apply the quasi maximum likelihood parametric 


approach. Namely, the approximation (3.7) is assumed to be the exact equality: 
def 


fx) = f%,0*) = OFWs) +... + oF Wy. Model misspecification f(-) # 
f(8) = Oi (x)+...+6,W, for any vector 8 € © means the modeling error, or, 
the modeling bias. The parametric approach ignores this modeling error and focuses 
on the error within the model which describes the accuracy of the qMLE 6. 

The qMLE procedure requires to specify the error distribution which appears in 
the log-likelihood. In the most general form, let Po be the joint distribution of the 
error vector e€, and let p")(e) be its density function on R”. The identities ¢; = 
Y; — f(X;, @) yield the log-likelihood 


L(0) = logp™ (Y — f (X,6)). (3.9) 


If the errors ¢; are i.i.d. with the density p(y), then 


L(8) = ) log p(¥i — f(Xi. 4)). (3.10) 


The most popular least squares method (3.4) implicitly assumes Gaussian homoge- 
neous noise: ¢; are iid. N(0,07). The LAD approach is based on the assumption 
of Laplace error distribution. Categorical data are modeled by a proper exponential 
family distribution; see Sect. 3.5. Below we assume that the one or another assump- 
tion about errors is fixed and the log-likelihood is described by (3.9) or (3.10). This 
assumption can be misspecified and the qMLE analysis has to be done under the true 
error distribution. Some examples of this sort for linear models are given in Sect. 4.6. 
In the rest of this section we only discuss how the regression function f in (3.5) 
can be approximated by different series expansions. With the selected expansion 
and the assumption on the errors, the approximating parametric model is fixed due 
to (3.9). In most of the examples we only consider a univariate design with d = 1. 


3.3.2 Polynomial Approximation 


It is well known that any smooth function f(-) can be approximated by a 
polynomial. Moreover, the larger smoothness of f(-) is the better the accuracy of 
approximation. The Taylor expansion yields an approximation in the form 


F(x) % Op + 01x + Oox? +... + Om x™. (3.11) 
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Such an approximation is very natural, however, it is rarely used in statistical 
applications. The main reason is that the different power functions W,,(x) = x” 
are highly correlated between each other. This makes difficult to identify the 
corresponding coefficients. Instead one can use different polynomial systems which 
fulfill certain orthogonality conditions. 

We say that f(x) is a polynomial of degree m if it can be represented in the 
form (3.11) with 6,, 4 0. Any sequence 1, Wi(x),..., Wn (x) of such polynomials 
yields a basis in the vector space of polynomials of degree m. 


Exercise 3.3.2. Let for each j < ma polynomial of degree j be fixed. Then any 
polynomial P,,,(x) of degree m can be represented in a unique way in the form 


Pin (X) = co + av (x) ap ae Cn Wm (x) 


Hint: define c,,. = Pp” / ve” and apply induction to P,(x) — CnWn(X). 


3.3.3. Orthogonal Polynomials 


Let jz be any measure on the real line satisfying the condition 


[ema <0, (3.12) 


for any integer m. This enables us to define the scalar product for two polynomial 
functions f, g by 


(fe) 2 / Flx)g(x)u(ds). 


With such a Hilbert structure we aim to define an orthonormal polynomial system 
of polynomials i, of degree m form = 0,1,2,... such that 


(¥j.¥m) = 8jm=UG =m), jm =0,1,2,.... 
Theorem 3.3.1. Given a measure j satisfying the condition (3.12) there exists 


unique orthonormal polynomial system W,, W2,.... Any polynomial Py, of degree 
m can be represented as 


P(x) = do taywi(x) +... + an Wn (x) 
with 


Gy = (Pas Wy) (3.13) 
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Proof. We construct the function y, successively. The function Wo is a constant 
defined by 


We i: ec eat 


Suppose now that the orthonormal polynomials y,...,W%n—1 have been already 
constructed. Define the coefficients 


aj # fry coun, j =0,1,...,m—1, 


and consider the function 


&m(X) “ oe _ aoWo _ aw (x) To Am—1Wm—1(%). 


This is obviously a polynomial of degree m. Moreover, by orthonormality of the 
w;’sfor j <m 


[ sco, (x) (dx) = pew (x) (dx) — a; / Wi (x) (dx) = 0. 
So, one can define yy, by normalization of gj: 


Win (x) = (2s eae eo 


One can also easily see that such defined y,, is only polynomial of degree m which 
is orthogonal to y; for 7 < m and fulfills (Vin. Vm) = 1, because the number of 
constraints is equal to the number of coefficients 0, ..., Qn Of Win (x). 

Let now P,,, be a polynomial of degree m. Define the coefficient a,, by (3.13). 
Similarly to above one can show that 


Pin(x) - {do tawi(x)+...+ Am Wm(x)+ =0 


which implies the second claim. 


Exercise 3.3.3. Let {Wn} be an orthonormal polynomial system. Show that for any 
polynomial P; (x) of degree j < m, it holds 


(P;, Vn) a 0. 


3.3.3.1 Finite Approximation and the Associated Kernel 


Let f be a function satisfying 


3.3. Linear Regression 89 


[ Pema 2, (3.14) 


Then the scalar product a; = ( fw . is well defined for all 7 > 0 leading for each 
m = | to the following approximation: 


m 


fn) 8 ay) = D> f Fev atauyy (0) 
j=0 j=0 


= f fepen(x, pla) (3.15) 


with 


On (xu) = Dov (x)VjW). 


j=0 


3.3.3.2 Completeness 


The accuracy of approximation of f by fj, with m growing is one of the central 
questions in the approximation theory. The answer depends on the regularity of the 
function f and on choice of the system {Wn}. Let F be a linear space of functions f 
on the real line satisfying (3.14). We say that the basis system {¥,,(x)} is complete 
in F if the identities (f Vn) = 0 for all m > O imply f = 0. As Wn(x) is a 
polynomial of degree m, this definition is equivalent to the condition 


(f£x")=0, m=0,1,2,...<> f =0. 


3.3.3.3 Squared Bias and Accuracy of Approximation 


Let f € F be a function in L» satisfying (3.14), and let {yw} be a complete basis. 
Consider the error f(x) — fin(x) of the finite approximation f,, (x) from (3.15). The 
Parseval identity yields 


[ Penman = Ya, 


m=0 


le-e) 
a 


This yields that the finite sums of YVi=0 a; converge to the infinite sum >", a, 


and the remainder bm = )°7~,, 4, 4; tends to zero with m: 
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[o.@) 
def 2 
bn 2 (f= ta) = f|F0) = fue Puld) = YO a2 + 0 
j=mt+l 
as m —> oo. The value b,, is often called the squared bias. Below in this section 


we briefly overview some popular polynomial systems used in the approximation 
theory. 


3.3.4 Chebyshev Polynomials 


Chebyshev polynomials are frequently used in the approximation theory because of 
their very useful features. These polynomials can be defined by many ways: explicit 
formulas, recurrent relations, differential equations, among others. 
3.3.4.1 A Trigonometric Definition 
Chebyshev polynomials is usually defined in the trigonometric form: 

T(x) = cos (m arccos(x)). (3.16) 


Exercise 3.3.4. Check that T,,,(x) from (3.16) is a polynomial of degree m. 


Hint: use the formula cos((m + 1)u) = 2cos(u) cos(mu) — cos((m — 1)u) and 
induction arguments. 
3.3.4.2. Recurrent Formula 


The trigonometric identity cos((m + 1)u) = 2cos(u) cos(mu) — cos((m - lu) 
yields the recurrent relation between Chebyshev polynomials: 


Tn+1(X) 7 2xT n(x) = Tin—1(X), m= if (3.17) 


Exercise 3.3.5. Describe the first 5 polynomials 7,,. 


Hint: use that Jo(x) = 1 and 7;(x) = x and use the recurrent formula (3.17). 


3.3.4.3 The Leading Coefficient 


The recurrent relation (3.17) and the formulas 7o(x) = 1 and 7T;(x) = x imply 
that the leading coefficient of T;,(x) is equal to 2’”—! by induction arguments. 
Equivalently 
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T&) (x) = 27"! 


3.3.4.4 Orthogonality and Normalization 


Consider the measure j(dx) on the open interval (—1, 1) with the density (1 — 
x?)~!/2 with respect to the Lebesgue measure. By the change of variables x = 
cos(u) we obtain for all 7 4m 


1 
/ ; Tin(x)Tj (x) 


dx u 
——— cos(mu) cos(ju)du = 0. 
V1— x? I 


Moreover, form > 1 


! dx x 1 (” 4 
T(x ——_ = / cos”(mu)du = a 1+ cos(2mu)!du = ~. 
[ ) V1—x? 0 2 Jo 2 


Finally, 


| 1 dx ie P 
——— = u=T. 
-1 V¥1—x? 0 
So, the orthonormal system can be defined by normalizing the Chebyshev polyno- 
mials 7), (x): 


Wo(x) = ql? Wn(x) = V2/m Tin(X), m> 1. 


3.3.4.5 The Moment Generating Function 


The bivariate function 
lo. ) 
fas)= > Ge (3.18) 
m=0 


is called the moment generating function. It holds for the Chebyshev polynomials 


l-t 
L. = —_—_. 
A eal ae ers) 


This fact can be proven by using the recurrent formula (3.17) and the following 
relation: 
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[ee 


f(x.t) = 14 ett Yo Tri (eye™ 


m=1 
Cc 

=1+4+t+t >) {2xTn(x) — Tn—i(~)}t” 
m=1 


=1+m+ 2nxf{f(x,t) -1}-07 f(x.2). (3.19) 


Exercise 3.3.6. Check (3.19) and (3.18). 


3.3.4.6 Roots of Tin 


The identity cos(z(k —1/2)) = 0 for all integer k yields the roots of the polynomial 
T(x): 


k—1/2 
Xin = cos( ET?) k=1,...,m. (3.20) 
m 

This means that Tin(xkm) = 0 fork = 1,...,m and hence, T;,(x) has exactly m 
roots on the interval [—1, 1]. 
3.3.4.7 Discrete Orthogonality 
L . _ (2k—1) 

et X1.,---,XN.w be the roots of Ty due to (3.20): xx,y = cos aay ae Define 


the discrete inner product 
N 
(Gul e=>_ Tn Gew yt Gen): 
k=1 


Then it holds similarly to the continuous case 


0 mFAj, 
(Tm,Tj)y =4N/2 m=j £0 (3.21) 
N m= j=0. 


Exercise 3.3.7. Prove (3.21). 
Hint: use that for all m > 0 


Ycos( 2) - 
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yielding for all m’ 4 m 


Yeo ( tie) cos( A a) =, 


3.3.4.8 Extremes of T,,,(x) 


Obviously Dre (x)| < 1 because the cos-function is bounded by one in absolute 


value. Moreover, cos(kz) = (=1)* yields the extreme points e, with T,,(e,) = 
(—1)* for 


ex = c0s(="). k=0,1,...,m. (3.22) 
m 


In particular, the edge points x = 1 and x = —1 are extremes of T,,, (x). 


Exercise 3.3.8. Check that T,,(e,) = (—1)* for e, from (3.22). Show that T;,(1) = 
1 and T,,,(—1) = (—1)”. Show that |7,,(x)| < 1 for x 4 e, on [-1, 1]. 


Hint: T,, is a polynomial of degree m, hence, it can have at most m — 1 extreme 
points inside the interval (—1, 1), which are e),..., @—1. 


3.3.4.9 Sup-Norm 


The important feature of the Chebyshev polynomials which makes them very useful 
for the approximation theory is that each of them minimizes the sup-norm over all 
polynomial of the certain degree with the fixed leading coefficient. 


Theorem 3.3.2. The scaled Chebyshev polynomial f(x) = 2!~-"Tm minimizes 
the sup-norm || f |loo e SUP e[—1,1] | f(x) over the class of all polynomials of degree 


m with the leading coefficient 1. 


Proof. As |Tm(x)| < 1, the sup-norm of fi, fulfills || fin|leoo = 2'7. Let w(x) be 
any other polynomial with the leading coefficient one and |w(x)| < 2!~”. Consider 
the difference f,,(x) — w(x) at the extreme points e, from (3.22). Then fin(ex) — 
w(ex) > 0 for all even k = 0,2,4,... and fin(ex) — wlex) < 0 for all odd k = 
1,3,5,.... This means that this difference has at least m roots on [—1, 1] which is 
impossible because it is a polynomial of degree m — 1. 


3.3.4.10 Expansion by Chebyshev Polynomials and Discrete Cosine 
Transform 


Let f(x) be a measurable function satisfying 
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: dx 
2 
D f OTe —_ 


Then this function can be uniquely expanded by Chebyshev polynomials: 


£) = Do amTn (x). 


m=0 
The coefficients a,, in this expansion can be obtained by projection 


! dx 
an = (i; Tn) => [ a. eee <C 


However, this method is numerically intensive. Instead, one can use the discrete 
orthogonality (3.21). Let some N be fixed and x;y = cos( =4512)), Then for 


m> 1 


N 
an = u d IS (xk) cos( ™E 1), 


This sum can be computed very efficiently via the discrete cosine transform. 


3.3.5 Legendre Polynomials 


The Legendre polynomials P,,(x) are often used in physics and in harmonic 
analysis. It is an orthogonal polynomial system on the interval [—1, 1] w.rt. the 
Lebesgue measure, that is, 


1 
/ Pin(x) Pin (x)dx = 0, msm’, (3.23) 
-1 


They also can be defined as solutions of the Legendre differential equation 


“ta = 2) Pa()] +m(m + 1)P,(x) = 0. (3.24) 


An explicit representation is given by the Rodrigues’ formula 


m 


Pn oe) = 2™m! gel 


(1 — xy]. (3.25) 
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Exercise 3.3.9. Check that P,,(x) from (3.25) fulfills (3.24). 
Hint: differentiate m + 1 times the identity 


d 
Ge a NE _ iy” = 2mx (x? a 1)" 
X 


yielding 
d 4 a” 
2 Pin (x) + 2X — Pm (x) + (x° — 1), Pm (x) 
dx dx 


d 
= 2mP,,(x) + mx Pm (x). 


3.3.5.1 Orthogonality 


The orthogonality property (3.23) can be checked by using the Rodrigues’ formula. 


Exercise 3.3.10. Check that form < m’ 


1 
/ P(X) Py (x)dx = 0. (3.26) 
1 
Hint: integrate (3.26) by part m + 1 times with P,, from (3.25) and use that the 
m + lth derivative of P,,, vanishes. 
3.3.5.2 Recursive Definition 


It is easy to check that Po(x) = 1 and P(x) = x. Bonnet’s recursion formula 
relates 3 subsequent Legendre polynomials: form > 1 


(m + 1) Pm4i(x) = (2m + 1)xPin(xX) — MPiy—1(X). (3.27) 


From Bonnet’s recursion formula one obtains by induction the explicit representa- 
tion 


Pa (x) = De v(; yay C= 


3.3.5.3 More Recursions 


Further, the definition (3.25) yields the following 3-term recursion: 
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x*?-1ld 
— Pin (x) = x Pin (x) = Pm—1(x) 
m_ dx 


Useful for the integration of Legendre polynomials is another recursion 


d 
(2m + 1) Pn (x) = ql Pmt i(®) — Pn ~i(x)|, 


Exercise 3.3.11. Check (3.28) by using the definition (3.25). 
Hint: use that 


qd™ (1 = x?) q™ Pre 
BL | = [sd - 7] 
2 as 2\m-1 an’ 2\m-1 
= 2x ia) =e ) 
Exercise 3.3.12. Check (3.27). 
Hint: use that 
d™ dy(l—x?)"t! a™ 2 
— —2 1 a, m 
arl m+ a ey] 
m qm! 
a) _ 2\m = 9 _ 2\m 
vam [( x ) ] ax"! [( x ) ] 


3.3.5.4 Generating Function 


The Legendre generating function is defined by 


= owe. 


m=0 


It holds 
f(t, x) = (1-24 x?)7/?, 


Exercise 3.3.13. Check (3.29). 


3.3.6 Lagrange Polynomials 


(3.28) 


(3.29) 


In numerical analysis, Lagrange polynomials are used for polynomial interpolation. 
The Lagrange polynomials are widely applied in cryptography, such as in Shamir’s 


Secret Sharing scheme. 
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Given a set of p + | data points (Xo, Yo), ..., (Xp, Y,), where no two X; are the 
same, the interpolation polynomial in the Lagrange form is a linear combination 


P 
L@&) = > Em(%)¥n (3.30) 


m=0 


of Lagrange basis polynomials 


def x—X; 
f= |] yz 
m J 


II 


J=0,....D, 
j#m 


x — Xo xX —Xm-1 xX — Xm4i x— Xp 
Xm — Xo a, Xm — Xm-1 Xm — Xm+i a Xm — Xp 


This definition yields that ¢,,(Xm) = 1 and €,,(X;) = 0 for 7 # m. Hence, 
P(Xim) = Ym for the polynomial L,,(x) from (3.30). One can easily see that L(x) 
is the only polynomial of degree p that fulfills P(X) = Yn. 

The main disadvantage of the Lagrange forms is that any change of the design 
X1,...,X, requires to change each basis function £,,(x). Another problem is that 
the Lagrange basis polynomials £,,,(x) are not necessarily orthogonal. This explains 
why these polynomials are rarely used in statistical applications. 


3.3.6.1 Barycentric Interpolation 
Introduce a polynomial £(x) of degree p + 1 by 
L(x) = (x — Xo)... (x — Xp). 


Then the Lagrange basis polynomials can be rewritten as 


lm(x) = x) 


m 


with the barycentric weights w,, defined by 


which is commonly referred to as the first form of the barycentric interpolation 
formula. The advantage of this representation is that the interpolation polynomial 
may now be evaluated as 
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Wm 
x — Xm 


Pp 
L(x) = €(x) > Yn 
m=0 


which, if the weights w,, have been pre-computed, requires only O(:~) operations 
(evaluating £(x) and the weights wy», /(x — Xm)) as opposed to O(p”) for evaluating 
the Lagrange basis polynomials £,,,(x) individually. 

The barycentric interpolation formula can also easily be updated to incorporate 
a new node X,,,, by dividing each of the w,, by (X,, — Xp+1) and constructing the 
new Wp+1 as above. 

We can further simplify the first form by first considering the barycentric 
interpolation of the constant function g(x) = 1: 


Wm 
x— Xp, 


Pp 
PO. Sd) 
m=0 


Dividing L(x) by g(x) does not modify the interpolation, yet yields 


eae Wm Yin / (x — Xm) 


EO aan Win / (x — Xm) 


which is referred to as the second form or true form of the barycentric interpolation 
formula. This second form has the advantage that £(x) need not be evaluated for 
each evaluation of L(x). 


3.3.7 Hermite Polynomials 


The Hermite polynomials build an orthogonal system on the whole real line. The 
explicit representation is given by 


def d™ e? 
dx” : 


Hm (x) = (-1)"e"" 


Sometimes one uses a “probabilistic” definition 


s i d™ 
An (x) a (-1)"e"/? am 


i 


Exercise 3.3.14. Show that each H,,(x) and An (x) is a polynomial of degree m. 


x 


Hint: Use induction arguments to show that ae * can be represented in the form 


Pin (x)e with a polynomial P,,(x) of degree m. 
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Exercise 3.3.15. Check that the leading coefficient of H,,(x) is equal to 2” while 
the leading coefficient of H,,,(x) is equal to one. 


3.3.7.1 Orthogonality 


The Hermite polynomials are orthogonal on the whole real line with the weight 
function w(x) = e*: for j#m 
[o,) 
/ An (x) A; (x)w(x)dx = 0. (3.31) 
—0o 


Note first that each H,,(x) is a polynomial so the scalar product (3.31) is well 
defined. Suppose that m > 7. It is obvious that it suffices to check that 


/ ~ Hn (x)x/ w(x)dx = 0, j<m. (3.32) 


Define fin(x) 2 ae Obviously f’_ (x) = fn(X). Integration by part yields 


m—1 
for any j > 1 
oo . oo . 
/ Ain (x) x! w(x)dx = (-1)”" x! fin (x)dx 
—oo —oo 
= : 
= 1" f x! fl _s(x)dx 
—oo 
Co 
=e ty fx fade. 
—oo 
By the same arguments, form > 1 


[ Tin(x)dx = / fy, (x)dx = f_, (00) — fy,_,(—o0) = 0. (3.33) 


This implies (3.32) and hence the orthogonality property (3.31). 


Now we compute the scalar product of H,,(x). Formula (3.33) with 7 = m 
implies 
CO CO 2 
/ Hy (x)x" w(x)dx = m! ff e dx = Jami. 
—oo —0o 


As the leading coefficient of H,,(x) is equal to 2’”, this implies 


/ H?(x)w(x)dx = 2” i An (x)x™w(x)dx = 7 2™m!. 
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Exercise 3.3.16. Prove the orthogonality of the probabilistic Hermite polynomials 
#Z,, (x). Compute their norm. 
3.3.7.2, Recurrent Formula 
The definition yields 
A(x) = 1, A(x) = 2x. 
Further, form > 1, we use the formula 


wa 
dx" 


d ma 
pr aad a Aes fe 


oP = axtinG)— Hii) 34) 
yielding the recurrent relation 


Am4i(x) = 2xHn(x) — H} (x). 


Moreover, integration by part, the formula (3.34), and the orthogonality property 
yield for 7 <m—1 


i Hy (x) Hj (x)w(x)dx = -[- n(x) H; (x)w(x)dx = 0 


This means that H/ (x) is a polynomial of degree m — 1 and it is orthogonal 
to all H;(x) for j < m— 1. Thus, H/ (x) coincides with H,,-\(x) up to a 
multiplicative factor. The leading coefficient of H/, (x) is equal to m2” while the 
leading coefficient of H,,—1(x) is equal to 2”! yielding 
(x)= 2m AQ); 

This results in another recurrent equation 

Am 4i1(X) = 2x Ay (x) — 2M Ay—\(x). 
Exercise 3.3.17. Derive the recurrent formulas for the probabilistic Hermite poly- 
nomials H,,(x). 


3.3.7.3 Generating Function 


The exponential generating function f(x,t) for the Hermite polynomials is 
defined as 
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f0)— 55 Hn) (3.35) 
mM. 


m=0 


It holds 
F(x, t) = exp{2xt — rr. 
It can be proved by checking the formula 
ain 
apm t t) = An(x —t) f (x,t) (3.36) 


Exercise 3.3.18. Check the formula (3.36) and derive (3.35). 


3.3.7.4 Completeness 


This property means that the system of the normalized Hermite polynomials builds 
an orthonormal basis in L> Hilbert space of functions f(x) on the real line satisfying 


[- f?(x)w(x)dx < 00. 


3.3.8 Trigonometric Series Expansion 


The trigonometric functions are frequently used in the approximation theory, in 
particular due to their relation to the spectral theory. One usually applies either the 
Fourier basis or the cosine basis. 

The Fourier basis is composed by the constant function Fo = | and the functions 
Fom—\(xX) = sin(2mzx) and Fy,(x) = cos(2mmx) form = 1,2,.... These 
functions are considered on the interval [0, 1] and are all periodic: f(0) = f(1). 
Therefore, it can be only used for approximation of periodic functions. 

The cosine basis is composed by the functions So = 1, and S,,(x) = cos(mzx) 
form => 1. These functions are periodic for even m and antiperiodic for odd m, this 
allows to approximate functions which are not necessarily periodic. 


3.3.8.1 Orthogonality 


Trigonometric identities imply orthogonality 


1 
(Fn, Fj) =} F,,(x)F)(x)dx=0, j #m. (3.37) 
0 
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Also 
1 
/ F?(x)dx = 1/2 (3.38) 
0 


Exercise 3.3.19. Check (3.37) and (3.38). 
Exercise 3.3.20. Check that 


1 
/ Sj (x)Sin(x)dx = ; I(j =m). 
0 


Many nice features of the Chebyshev polynomials can be translated to the cosine 
basis by a simple change of variable: with u = cos(zx), it holds S,,(x) = Ti,(u). 
So, any expansion of the function f(u) by the Chebyshev polynomials yields an 
expansion of f (cos(xrx)) by the cosine system. 


3.4 Piecewise Methods and Splines 
This section discusses piecewise polynomial methods of approximation of the 
univariate regression functions. 


3.4.1 Piecewise Constant Estimation 


Any continuous function can be locally approximated by a constant. This naturally 
leads to the basis consisting of piecewise constant functions. Let A},..., Ax bea 
non-overlapping partition of the design space X: 


X= (Aas Ap N Ap =O, kK Ak’. (3.39) 


We approximate the function f by a finite sum 


K 
f(x) © f(x. 0) = > % M(x € Ax). (3.40) 

k=1 
Here 6 = ((,..., 6)" with p = K. A nice feature of this approximation is 
that the basis indicator functions w,,...,Wx are orthogonal because they have 


non-overlapping supports. For the case of independent errors, this makes the 
computation of the qMLE @ very simple. In fact, every coefficient 6, can be 
estimated independently of the others. Indeed, the general formula (3.10) yields 
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> 
II 


argmax L(@) = argmax > ¢(Y; — f(X,9)) 
0 0 


i=1 


II 


M 
argmax )* S* &(¥; — 4). (3.41) 
9= (6) k=1 X;€Ax 


Exercise 3.4.1. Show that 6, can be obtained by the constant approximation of the 
data Y; for X; € Ax: 


6, =argmax )° &(Y¥;-&), k=1,...,K. (3.42) 
% YEAR 


A similar formula can be obtained for the target 0* = (0°) = argmaxy EL(6): 


6* = argmax E¢(Y; — 9x), m=1,...,K. 
O% =X; EAp 


The estimator 6 can be computed explicitly in some special cases. In particular, 
if p corresponds a density of a normal distribution, then the resulting estimator 6, is 
nothing but the mean of observations Y; over the piece A;. For the Laplacian errors, 
the solution is the median of the observations over A;. First we consider the case of 
Gaussian likelihood. 


Theorem 3.4.1. Let £(y) = —y?/(207)+ R be a log-density of anormal law. Then 
for everyk = 1,...,K 


== OM. 


X; €Ag 
1 
Oy = — EY; , 
k Nx ba 
Xj CAR 
where Nx stands for the number of design points X; within the piece Ax: 
Ne S* 1 = #5: X; € Ach. 
Xj, ECA 
Exercise 3.4.2. Check the statements of Theorem 3.4.1. 


The properties of each estimator 6, repeats ones of the MLE for the sample 
retracted to Az; see Sect. 2.9.1. 


Theorem 3.4.2. Let 6 be defined by (3.41) for a normal density p(y). Then with 
6* = (0*,...,0x)' = argmax, EL(0), it holds 
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és 1 
EG, = 6%, — Var(6x) = — >_> Var(¥). 
Ni Xi CAR 


Moreover, 


7 =. Ne ae 
L(6,0*) = ss agate — oF). 
k=1 


The statements follow by direct calculus on each interval separately. 
If the errors ¢; = Y; — EY; are normal and homogeneous, then the distribution 
of the maximum likelihood L(6, @*) is available. 


Theorem 3.4.3. Consider a Gaussian regression Y; ~ N(f(X;),07) fori = 
1,...,. Then 0 ~ N(0*,07/Ng) and 


= Ne iz 2 
LO,0*) = Dra (% — 9) ~ Xk 


m=1 


where Xe stands for the chi-squared distribution with K degrees of freedom. 


This result is again a combination of the results from Sect. 2.9.1 for different 
pieces A;. It is worth mentioning once again that the regression function f(-) is not 
assumed to be piecewise constant, it can be whatever function. Each 6, estimates 
the mean 6, of f(-) over the design points X; within Ag. 

The results on the behavior of the maximum likelihood LO, 0*) are often used 
for studying the properties of the chi-squared test; see Sect. 7.1 for more details. 

A choice of the partition is an important issue in the piecewise constant 
approximation. The presented results indicate that the accuracy of estimation of 
6* by 0; is inversely proportional to the number of points N; within each piece 
A;. In the univariate case one usually applies the equidistant partition: the design 
interval is split into p equal intervals A, leading to approximately equal values 
Nx. Sometimes, especially if the design is irregular, a nonuniform partition can be 
preferable. In general it can be recommended to split the whole design space into 
intervals with approximately the same number Nj, of design points X;. 

A constant approximation is often not accurate enough to expand a regular 
regression function. One often uses a linear or polynomial approximation. The next 
sections explain this approach for the case of a univariate regression. 


3.4.2 Piecewise Linear Univariate Estimation 


The piecewise constant approximation can be naturally extended to piecewise 
linear and piecewise polynomial construction. The starting point is again a 
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non-overlapping partition of X into intervals A, fork = 1,..., K. First we explain 
the idea for the linear approximation of the function f on each interval A,. Any 
linear function on A; can be represented in the form a; +c,x with some coefficients 
ax, Ck. This yields in total p = 2K coefficients: 0 = (a),c,,...,ax,cx)'. The 
corresponding function f(-, 8) can be represented as 


K 
f(x) © f(x,0) = Jaa + cex) M(x € Ax). 


k=1 


The non-overlapping structure of the sets Ax yields orthogonality of basis functions 
for different pieces. As a corollary, one can optimize the linear approximation on 
every interval A; independently of the others. 


Exercise 3.4.3. Show that a,, C; can be obtained by the linear approximation of the 
data Y; for X; € Ax: 
(x, Ck) = argmax ) ° £(Y; — ap — cK X;) W(X; € Ax), k =1,...,K. 


(ax ck) 


On every piece A,;, the constant and the linear function x are not orthogonal 
except some very special situation. However, one can easily achieve orthogonality 
by a shift of the linear term. 


Exercise 3.4.4. For each k < K, there exists a point x, such that 
Y (Xj = xx) W(X; € Ag) = 0. (3.43) 


Introduce for each k < K two basis functions ¢;-1(x) = I(x € Ax) and 
bj (x) = (x — x) W(x € Ag) with j = 2k. 


Exercise 3.4.5. Assume (3.43) for each k < K. Check that any piecewise linear 
function can be uniquely represented in the form 


Dp 
F@) = > 6;6;@) 
j=l 
with p = 2K and the functions ¢; are orthogonal in the sense that for 7 # j’ 
>. 9) (X1)b;(Xi) = 0. 
i=l 


In addition, for each k < K 


106 3 Regression Estimation 


dct Ne, J =2k—-1, 
lel? = >0 7 (X) = 


= Ve. fj =2k 
Ne= YL WS YX -—). 
XE Ag Xj €Ag 


In the case of Gaussian regression, orthogonality of the basis helps to gain a 
simple closed form for the estimators 6 = (6; ): 


a Teme is j =2k-1, 
7 [ha lP 


Yip) (Xi) = | 
2 73 Dyes, 443 —%%), J = 2k. 


i=1 


see Sect. 4.2 in the next chapter for a comprehensive study. 


3.4.3 Piecewise Polynomial Estimation 


Local linear expansion of the function f(x) on each piece A, can be extended to a 
piecewise polynomial case. The basic idea is to apply a polynomial approximation 
of a certain degree g on each piece A, independently. One can use for each piece 
a basis of the form (x — x;,)” I(x € Ax) form = 0,1,...,g with x, from (3.43) 
yielding the approximation 


F(x) W(x € Ax) = f(x, ax) W(x € Ax) 


K 
_ Y > {a04 t+ dig (X — Xk) +... + Ag R(x — xKE)*} Me € Ax) 
k=l 


for a, = (0%, 41k,+-- si) This involves gq + 1 parameter for each piece and 
p = K(q + 1) parameters in total. A nice feature of the piecewise approach is 
that the coefficients a; of the piecewise polynomial approximation can be estimated 
independently for each piece. Namely, 


a, = argmax {Y; _ f(Xi,ayV 


# Xj CAR 


The properties of this estimator will be discussed in detail in Chap. 4. 


3.4.4 Spline Estimation 


The main drawback of the piecewise polynomial approximation is that the resulting 
function f is discontinuous at the edge points between different pieces. A natural 
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way of improving the boundary effect is to force some conditions on the boundary 
behavior. One important special case is given by the spline system. Let X be an 
interval on the real line, perhaps infinite. Let also tg < t) < ... < tx be some 
ordered points in X such that fo is the left edge and fx the right edge of X. Such 
points are called knots. We say that a function f is a spline of degree q at knots 
(t;,) if it is polynomial on each span (t,—-1,t,) fork = 1,...,K and satisfies the 
boundary conditions 


fM(u-) = $M), m=0,...,g—1, k=1,...,K—-1. 


Here f”)(t—) stands for the left derivative of f at ¢. In other words, the function f 
and its first g — 1 derivatives are continuous on X and only the qth derivatives may 
have discontinuities at the knots f;. It is obvious that the g derivative f(t) of the 
spline of degree g is a piecewise constant functions on the spans Ax = [t,—1, tx). 

The spline is called uniform if the knots are equidistant, or, in other words, if all 
the spans A; have equal length. Otherwise it is nonuniform. 


Lemma 3.4.1. The set of all splines of degree q at knots (t,) is a linear space, that 
is, any linear combination of such splines is again a spline. Any function having a 
continuous mth derivative form < K and piecewise constant qth derivative is a 
q-spline. 


Splines of degree zero are just piecewise constant functions studied in Sect. 3.4.1. 
Linear splines are particularly transparent: this is the set of all piecewise linear 
continuous functions on X. Each of them can be easily constructed from left to right 
or from right to left: start with a linear function a, + c;x on the piece A; = [fo, t]. 
Then f(t)) = a; + c1t,. On the piece A> the slope of f can be changed for cz 
leading to the function f(x) = f(t) + co(x —t) for x € [t,t]. Similarly, at 
ty the slop of fs can change for c3 yielding f(x) = f(t.) + ¢3(x — f2) on As, 
and so on. Splines of higher order can be constructed similarly step by step: one 
fixes the polynomial form on the very first piece A; and then continues the spline 
function to every next piece Ax using the boundary conditions and the value of the 
qth derivative of f on Ax. This construction explains the next result. 


Lemma 3.4.2. Each spline f of degree q and knots (t,) is uniquely described by 
the vector of coefficients a, on the first span and the values f(x) for each span 
Aj,..., AK. 


This result explains that the parameter dimension of the linear spline space is 
q + K. One possible basis in this space is given by polynomials x’"~! of degree 
m=0,1,..., q and the functions ¢; (x) “ (x — te) fork =1,...,K—1. 


Exercise 3.4.6. Check that ¢;(x) for 7 = 1,...,q + K form a basis in the linear 
spline space, and any q-spline f can be represented as 
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q K-1 
f(x) = Yo amx” + Y~ Oedx (x). (3.44) 
k=1 


m=0 


Hint: check that the functions ¢;(x) are linearly independent and that each gth 
derivative gy? (x) is piecewise constant. 


3.4.4.1 B-Splines 


Unfortunately, the basis functions {f;(x)} with o,(x) = (x — ie). are only useful 
for theoretical study. The main problem is that the functions #;(x) are strongly 
correlated, and the recovering the coefficients 6; in the expansion (3.44) is a hard 
numerical task. by this reason, one often uses another basis called B-splines. The 
idea is to build splines of the given degree with the minimal support. Each B-spline 
basis function bx, (x) is only nonzero on the g neighbor spans Az, Ax4i,...Ak+g—1 
fork = 1,...,K —q. 


Exercise 3.4.7. Let f(x) be a qg-spline with the support on q’ < q neighbor spans 
Ax, Ak-+1, sae Ak-+g!—1- Then F(x) =0. 
Hint: consider any spline of the form f(x) = pe “le 0; (x). Show that the 


boundary conditions f (eta) = Oform =0,1,...,q yieldc; = 0. 


The basis B-spline functions can be constructed successfully. For g = 0, the 
B-splines b; 9(x) coincide with the functions ¢,(x) = I(x € Ax), k = 1,...,K. 
Each linear B-spline b, .;(x) has a triangle shape on the two connected intervals Ax 
and A;+ 1. It can be defined by the formula 


def X — fk- thai) —X 
bya(x) = deo (x) + 2 es io(x), k= 1,...,K—1. 
tk — tk-1 tkti— 


tk 
One can continue this way leading to the Cox—de Boor recursion formula 


tkhtm — X 
Di.m—1(X) + Fg Dk thm) 
— lk 


k+m 


def XX —tk-1 
Dim (x) = ———— 


tktm—1 — tk-1 
fork =1,...,K —m. 


Exercise 3.4.8. Check by induction for each function bx. (x) the following condi- 
tions: 


1. bem(x) a polynomial of degree m on each span Ax,..., Ax+m—1 and zero 
outside; 
2. by.m(x) can be uniquely represented as a sum by (x) = a CLKOK +X); 


3. bk.m(X) is a m-spline. 
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The formulas simplify for the uniform splines with equal span length A = |Aj,|: 


def X — tk thim—X 
bk.m(X) _ Sie NC) + i tay he) 
mA mA 


fork =1,...,K —m. 
Exercise 3.4.9. Check that 


bk. (x) = ¥ OL mPk+l (x) 


1=0 
with 


oy, (-1)! 
im A" —D! 


3.4.4.2 Smoothing Splines 


Such a spline system naturally arises as a solution of a penalized maximum 
likelihood problem. Suppose we are given the regression data (Y;, X;) with the 
univariate design X; < Xz < ... < X,. Consider the mean regression model 
Y; = f(X;) + & with zero mean errors ¢;. The assumption of independent 
homogeneous Gaussian errors leads to the Gaussian log-likelihood 


Lf) = —|%i -— f(%)|'/ 20?) (3.45) 


i=1 


Maximization of this expression w.r.t. all possible functions f or, equivalently, all 
vectors (f(X1),..., 705))" results in the trivial solution: f(X;) = Y;. This 
means that the full dimensional maximum likelihood perfectly reproduces the 
original noisy data. Some additional assumptions are needed to force any desirable 
feature of the reconstructed function. One popular example is given by smoothness 
of the function f. Degree of smoothness (or, inversely, degree of roughness) can be 
measured by the value 


RGy= [lrocoPar. (3.46) 


One can try to optimize the fit (3.45) subject to the constraint on the amount of 
roughness from (3.46). Equivalently, one can optimize the penalized log-likelihood 


: 1 n 
L(A) # LU) = ARef) = 55 DIF - FRI =A f |F GPa, 


i=1 
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where A > 0 is a Lagrange multiplier. The corresponding maximizer is the penalized 
maximum likelihood estimator: 


f= argmax Li (f), (3.47) 


where the maximum is taken over the class of all measurable functions. It is 
remarkable that the solution of this optimization problem is a spline of degree q 
with the knots X),..., Xn. 


Theorem 3.4.4. For any A > 0 and any integer q, the problem (3.47) has a unique 
solution which is a q-spline with knots at design points (X;). 


For the proof we refer to Green and Silverman (1994). Due to this result, one 
can simplify the problem and look for a spline f which minimizes the objective 
L,(f). A solution to (3.47) is called a smoothing spline. If f is a q-spline, the 
integral R,(f) can be easily computed. Indeed, f(x) is piecewise constant, that 
is, f(x) = cx for x € Ax, and 


K 
Ralf) = doe |e — t-al. 
k=1 


For the uniform design, the formula simplifies even more, and by change of the 
multiplier A, one can use Ry(f) = >-; Ge The use of any parametric representation 
of a spline function f allows to represent the optimization problem (3.47) as a 
penalized least squares problem. Estimation and inference in such problems are 
studied below in Sect. 4.7. 


3.5 Generalized Regression 


Let the response Y; be observed at the design point X; € R?’,i = 1,...,n. 
A (mean) regression model assumes that the observed values Y; are independent 
and can be decomposed into the systematic component f(X;) and the individual 
centered stochastic error ¢;. In some cases such a decomposition is questionable. 
This especially concerns the case when the data Y; are categorical, e.g. binary 
or discrete. Another striking example is given by nonnegative observations Y;. In 
such cases one usually assumes that the distribution of Y; belongs to some given 
parametric family (P,,v € U) and only the parameter of this distribution depends 
on the design point X;. We denote this parameter value as f(X;) € U and write the 
model in the form 


Yi ~ Prax. 
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As previously, f(-) is called a regression function and its values at the design points 
X; completely specify the joint data distribution: 


Mis I] Pyx;)- 


Below we assume that (P,,) is a univariate exponential family with the log-density 
L(y, v). 

The parametric modeling approach assumes that the regression function f can 
be specified by a finite-dimensional parameter 6 ¢ © C R?: f(x) = f(x, 0). As 
usual, by 6* we denote the true parameter value. The log-likelihood function for 
this model reads 


L(0) = >> &(¥;. f(%. 9). 
The corresponding MLE @ maximizes L(@): 
6 = argmax ) ° ¢(Y;, SX, 6)). 
Q i 
The estimating equation VL(6@) = 0 reads as 


Ye (¥i. f(Xr,8))V F(X, 8) = 0 


where ’(y, v) = dL(y, v)/dv. 

The approach essentially depends on the parametrization of the considered EF. 
Usually one applies either the natural or canonical parametrization. In the case of the 
natural parametrization, £(y, v) = C(v)y — B(v), where the functions C(-), B(-) 
satisfy B’(v) = uC'(v). This implies £’(y, v) = yC'(v)— B’(v) = (y—v)C'(v) 
and the estimating equation reads as 


Y(¥i — (Xi. Cf (Xi. 0) V F(X.) = 0 


i 


Unfortunately, a closed form solution for this equation exists only in very special 
cases. Even the questions of existence and uniqueness of the solution cannot be 
studied in whole generality. Some numerical algorithms are usually applied to solve 
the estimating equation. 


Exercise 3.5.1. Specify the estimating equation for generalized EFn regression and 
find the solution for the case of the constant regression function f(X;,@) = 0. 
Hint: If f(X;, 6) = @, then the Y; arei.i.d. from Po. 
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The equation can be slightly simplified by using the canonical parametrization. 
If (P.) is an EFc with the log-density €(y, v) = yu — d(v), then the log-likelihood 
L(@) can be represented in the form 


LO) = DULY F(X, 0) — d (FX, 8))f. 


L 


The corresponding estimating equation is 


Y¥i —d!( F(X, 0))}V F(X, 8) = 0. 


L 


Exercise 3.5.2. Specify the estimating equation for generalized EFc regression and 
find the solution for the case of constant regression with f(X;,v) = v. Relate the 
natural and canonical representation. 


A generalized regression with a canonical link is often applied in combination with 
linear modeling of the regression function considered in the next section. 


3.5.1 Generalized Linear Models 


Consider the generalized regression model 
Yi ~ Prax € P. 


In addition we assume a linear (in parameters) structure of the regression func- 
tion f(X). Such modeling is particularly useful to combine with the canonical 
parametrization of the considered EF with the log-density €(y,v) = yu — d(v). 
The reason is that the stochastic part in the log-likelihood of an EFc linearly depends 
on the parameter. So, below we assume that P = (Py, uv € U) is an EFc. 

Linear regression f(X;) = wr 0 with given feature vectors UV; € R? leads to 
the model with the log-likelihood 


LO) = So{¥; U7 0 — d(¥76)}. 


L 


Such a setup is called generalized linear model (GLM). Note that the log-likelihood 
can be represented as 


L(0) = S'@—A(@), 


where 


S=SOYM;, AO) = ody"). 
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The corresponding MLE 6 maximizes L(@). Again, a closed form solution only 
exists in special cases. However, an important advantage of the GLM approach is 
that the solution always exists and is unique. The reason is that the log-likelihood 
function L(@) is concave in 0. 


Lemma 3.5.1. The MLE 6 solves the following estimating equation: 


VL(0) = S—VA(0) = ~(% = d' (YT O)) =i (3.48) 


LU 


The solution exists and is unique. 


Proof. Define the matrix 


BO) = Sod" (Ul OU uF (3.49) 


L 


Since d’(v) is strictly positive for all u, the matrix B(@) is positively defined as 
well. It holds 


V*>L(0) = —V*A(0) = — Yd" (U7 6)U;, U7 = BO). 


Thus, the function L(@) is strictly concave w.r.t. @ and the estimating equation 
VL(@) = S — VA(@) = 0 has the unique solution 0. 


The solution of (3.48) can be easily obtained numerically by the Newton— 
Raphson algorithm: select the initial estimate 6. Then for every k > 1 apply 


oF) — 9 + BEM)" {5 VA} (3.50) 


until convergence. 
Below we consider two special cases of GLMs for binary and Poissonian data. 


3.5.2 Logit Regression for Binary Data 


Suppose that the observed data Y; are independent and binary, that is, each Y; is 
either zero or one, i = 1,...,”. Such models are often used in, e.g., sociological 
and medical study, two-class classification, binary imaging, among many other 
fields. We treat each Y; as a Bernoulli r.v. with the corresponding parameter 
fi = f(X). This is a special case of generalized regression also called binary 
response models. The parametric modeling assumption means that the regression 
function f(-) can be represented in the form f(X;) = f(X;, 6) for a given class of 
functions { f(-,0),@ € © € R?}. Then the log-likelihood L(@) reads as 
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L(O) = YL(Y;, f(Xi.8)), (3.51) 


where £(y, v) is the log-density of the Bernoulli law. For linear modeling, it is more 
useful to work with the canonical parametrization. Then f(y, v) = yu—log(1+e”), 
and the log-likelihood reads 


L(0) = hace 0) - log(1 4 of (Xi8))] 


L 


In particular, if the regression function f(-, 4) is linear, that is, f(X;,0) = yr 0, 
then 


L@) = )_[¥,¥7 0 —log(1 + e%'%)]. (3.52) 


The corresponding estimate reads as 


6 = argmax L(@) = argmax ) “[¥;W;" 6 —log(1 + e¥'%)] 
6 0) ; 


This modeling is usually referred to as logit regression. 
Exercise 3.5.3. Specify the estimating equation for the case of logit regression. 


Exercise 3.5.4. Specify the step of the Newton—Raphson procedure for the case of 
logit regression. 


3.5.3 Parametric Poisson Regression 


Suppose that the observations Y; are nonnegative integer numbers. The Poisson 
distribution is a natural candidate for modeling such data. It is supposed that the 
underlying Poisson parameter depends on the regressor X;. Typical examples arise 
in different types of imaging including medical positron emission and magnet 
resonance tomography, satellite and low-luminosity imaging, queueing theory, high 
frequency trading, etc. The regression equation reads 


Y; ~ Poisson(f(X;)). 


The Poisson regression function f(X;) is usually the target of estimation. The 
parametric specification f(-) € { FC¢,9),9 € @} reduces this problem to 
estimating the parameter 0. Under the assumption of independent observations Y;, 
the corresponding maximum likelihood L(@) is given by 


L(O) = )-[Y¥; log{ f(X;.)} — f(Xi.0)] +R. 
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where the remainder R does not depend on @ and can be omitted. Obviously, the 
constant function family f(-, 0) = @ leads back to the case of i.i.d. modeling studied 
in Sect. 2.11. A further extension is given by linear Poisson regression: f(X;,0) = 
wT 0 for some given factors Y;. The regression equation reads 


L(@) = ¥_[Y¥; log(w' 0) — v7 6)]. (3.53) 


L 


Exercise 3.5.5. Specify the estimating equation and the Newton—Raphson proce- 
dure for the linear Poisson regression (3.53). 


An obvious problem of linear Poisson modeling is that it requires all the values 
wr 0 to be positive. The use of canonical parametrization helps to avoid this 
problem. The linear structure is assumed for the canonical parameter leading to the 
representation f(X;) = exp(W, 0). Then the general log-likelihood process L(@) 
from (3.51) translates into 


LO) = 5 [Y;W) 0 — exp(W 6)]; (3.54) 


L 


cf. with (3.52). 


Exercise 3.5.6. Specify the estimating equation and the Newton—Raphson proce- 
dure for the canonical link linear Poisson regression (3.54). 


If the factors Y; are properly scaled, then the scalar products yr @ for all i and 
all 8 € © belong to some bounded interval. For the matrix B(@) from (3.49), it 
holds 


BO) = So exp(¥) OU". 


U 


Initializing the ML optimization problem with 6 = 0 leads to the oLSE 
i -1 
a = (> wi) Siwy. 
The further steps of the algorithm (3.50) can be done as weighted LSE with the 
weights exp(W" 6) for the estimate 6 obtained at the previous step. 
3.5.4 Piecewise Constant Methods in Generalized Regression 


Consider a generalized regression model 


Yi oP rex € (Pu) 
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for a given exponential family (P,,). Further, let A,,..., Ax be a non-overlapping 
partition of the design space X; see (3.39). A piecewise constant approxima- 
tion (3.40) of the regression function f(-) leads to the additive log-likelihood 
structure: for 0 = (6),...,0x)" 


K 
6 = ates LG) — = argmax ) | x L(¥;,, Ox); 


Geil $1 ¥eAy 


cf. (3.41). Similarly to the mean regression case, the global optimization w.r.t. the 
vector 8 can be decomposed into K separated simple optimization problems: 


6; = argmax £(Y;, 9); 
max > 


Xj €Ax 


cf. (3.42). The same decomposition can be obtained for the target 90* = 


(01, ecisheny Ox)!: 


K 
0* = Tae EN = argmax ) | ye E¢(Y;, 0). 
D1 PK K=1 X;EAL 


The properties of each estimator 6 repeats ones of the qMLE for a univariate EFn; 
see Sect. 2.11. 


Theorem 3.5.1. Let £(y, 6) = C(@)y — B(@) be a density of an EFn, so that the 
functions B(@) and C(6) satisfy B'(@) = 8C'(@). Then for everyk = 1,...,K 


de = » Y;, 
K VEAy 

6 = 5 >> EY, 
K XY EAy 


where Nx stands for the number of design points X; within the piece Ax: 


c= Do 14: X; © Ax}. 
X; CA 


Moreover, it holds 
EO, = 67 


and 
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K 
L(O,0*) = YN K(x, Of) (3.55) 
k=1 


where K(0, 0’) = Eg {l(¥;,0) — UY), 0}. 


These statements follow from Theorem 3.5.1 and Theorem 2.11.1 of Sect. 2.11. 
For the presented results, the true regression function f(-) can be of arbitrary 
structure, the true distribution of each Y; can differ from P/:x,). 


Exercise 3.5.7. Check the statements of Theorem 3.5.1. 


If PA is correct, that is, if f is indeed piecewise constant and the distribution of Y; 
is indeed P :x,), the deviation bound for the excess L(0;, Of ) from Theorem 2.11.4 
can be applied to each piece Ax yielding the following result. 


Theorem 3.5.2. Let (P) be a EFn and let Y; ~ Po, for X; € Ag andk = 
1,..., K. Then for any 3 > 0 


P(L(6,0*) > K3) <2Ke™®. 


Proof. By (3.55) and Theorem 2.11.4 
~ K ~ 
P(L(6,0*) > K3) = P(> Ni K(O, 0°) > Ks) 
k=1 


K 
2) P(MeX (6, or) > 3) <2Ke™3 
k=1 
and the result follows. 


A piecewise linear generalized regression can be treated in a similar way. The 
main benefit of piecewise modeling remains preserved: a global optimization over 
the vector 6 can be decomposed into a set of small optimization problems for each 
piece A;. However, a closed form solution is available only in some special cases 
like Gaussian regression. 


3.5.5 Smoothing Splines for Generalized Regression 


Consider again the generalized regression model 


Yi ~ Prix) € P 
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for an exponential family P with canonical parametrization. Now we do not assume 
any specific parametric structure for the function f. Instead, the function f is 
supposed to be smooth and its smoothness is measured by the roughness R,(/) 
from (3.46). Similarly to the regression case of Sect. 3.4.4, the function f can be 
estimated directly by optimizing the penalized log-likelihood L,(/): 


fa = angmax Li(f) = argmax{L(f) — Ra(f)} 


II 


argmax ) {Yi (Xi) — d(F(XD))} - I Fo) Pdr. (3.56) 


The maximum is taken over the class of all regular g-times differentiable functions. 
In the regression case, the function d(-) is quadratic and the solution is a spline 
functions with knots X;. This conclusion can be extended to the case of any 
convex function d(-), thus, the problem (3.56) yields a smoothing spline solution. 
Numerically this problem is usually solved by iterations. One starts with a quadratic 
function d(v) = v?/2 to obtain an initial approximation f(-) of f(-) by a 
standard smoothing spline regression. Further, at each new step k + 1, the use 
of the estimate f“)(-) from the previous step k for k > 0 helps to approximate 
the problem (3.56) by a weighted regression. The corresponding iterations can be 
written in the form (3.50). 


3.6 Historical Remarks and Further Reading 


A nice introduction in the use of smoothing splines in statistics can be found in 
Green and Silverman (1994) and Wahba (1990). For further properties of the spline 
approximation and algorithmic use of splines see de Boor (2001). 

Orthogonal polynomials have long stories and have been applied in many 
different fields of mathematics. We refer to Szeg6 (1939) and Chihara (2011) for 
the classical results and history around different polynomial systems. 

Some further methods in regression estimation and their features are described, 
e.g., in Lehmann and Casella (1998), Fan and Gijbels (1996), and Wasserman 
(2006). 


Chapter 4 
Estimation in Linear Models 


This chapter studies the estimation problem for a linear model. The first four 
sections are fairly classical and the presented results are based on the direct analysis 
of the linear estimation procedures. Sections 4.5 and 4.6 reproduce in a very short 
form the same results but now based on the likelihood analysis. The presentation 
is based on the celebrated chi-squared phenomenon which appears to be the 
fundamental fact yielding the exact likelihood-based concentration and confidence 
properties. The further sections are complementary and can be recommended for a 
more profound reading. The issues like regularization, shrinkage, smoothness, and 
roughness are usually studied within the nonparametric theory, here we try to fit 
them to the classical linear parametric setup. A special focus is on semiparametric 
estimation in Sect. 4.9. In particular, efficient estimation and chi-squared result are 
extended to the semiparametric framework. 

The main tool of the study is the quasi maximum likelihood method. We 
especially focus on the validity of the presented results under possible model 
misspecification. Another important issue is the way of measuring the estimation 
loss and risk. We distinguish below between response estimation or prediction 
and the parameter estimation. The most advanced results like chi-squared result 
in Sect. 4.6 are established under the assumption of a Gaussian noise. However, a 
misspecification of noise structure is allowed and addressed. 


4.1 Modeling Assumptions 


A linear model assumes that the observations Y; follow the equation: 
¥, = Wl 0* +6; (4.1) 


fori = 1,...,n, where 0* = (Of iay0)” € R? is an unknown parameter 
vector, VW; are given vectors in R?, and the ¢;’s are individual errors with zero mean. 
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A typical example is given by linear regression (see Sect. 3.3) when the vectors W; 
are the values of a set of functions (e.g., polynomial, trigonometric) series at the 
design points X;. 

A linear Gaussian model assumes in addition that the vector of errors e = 
(€1,... En)! is normally distributed with zero mean and a covariance matrix ©: 


e~ N(0, 5). 


In this chapter we suppose that & is given in advance. We will distinguish between 
three cases: 


1. the errors ¢; are i.i.d. N(0O, 07), or equivalently, the matrix © is equal to oI, 
with 7,, being the unit matrix in R”. 

2. the errors are independent but not homogeneous, that is, Ee? = 07. Then the 
matrix is diagonal: © = diag(o7, ote ie . 


3. the errors ¢; are dependent with a covariance matrix &. 


In practical applications one mostly starts with the white Gaussian noise assump- 
tion and more general cases 2 and 3 are only considered if there are clear indications 
of the noise inhomogeneity or correlation. The second situation is typical, e.g., for 
the eigenvector decomposition in an inverse problem. The last case is the most 
general and includes the first two. 


4.2 Quasi Maximum Likelihood Estimation 


Denote by Y = (%,..., 1" (resp. € = (€],... sen)") the vector of observations 
(resp. of errors) in R” and by W the p x n matrix with columns W;. Let also UV" 
denote its transpose. Then the model equation can be rewritten as: 


Y=W'O* +e, e~ N(0,). 


An equivalent formulation is that ©~!/2(¥Y — W'@) is a standard normal vector in 
IR”. The log-density of the distribution of the vector Y = (%,..., au w.r.t. the 
Lebesgue measure in RR” is therefore of the form 


n log(det X | eee 
L(0) = —= log(2x) — ae sla? — vl ay? 
2 2 2 
log(detS) 1 
= 4 log(2r) — pa — 5(¥ — ¥TO)TE"(Y — WT8), 


In case | this expression can be rewritten as 


er 2 ly Tp\2 
L(6) = —5 log(2x0*) — pe — wi). 


i=1 
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In case 2 the expression is similar: 


n 


L(@@) = — 5 log(202) + 


i=1 


= |) 
20? 


The maximum likelihood estimate (MLE) 6 of 0* is defined by maximizing the 
log-likelihood L(@): 


6 = argmax L(0) = argmin(Y — ¥'0)' >" '(y — w' 8), (4.2) 
OER? 6ER? 


We omit the other terms in the expression of L(@) because they do not depend on 6. 
This estimate is the least squares estimate (LSE) because it minimizes the sum of 
squared distances between the observations Y; and the linear responses wT 0. Note 
that (4.2) is a quadratic optimization problem which has a closed form solution. 
Differentiating the right-hand side of (4.2) w.r.t. 8 yields the normal equation 


we wg =ur'y, 


If the p x p-matrix YVX~!W" is non-degenerate, then the normal equation has the 
unique solution 


6 =(wa"'w') ‘wy = SY, (4.3) 
where 
S=(vS ul) tus 
is a p X n matrix. We denote by Bin the entries of the vector 6, m=1,...,p. 


If the matrix WX~!W" is degenerate, then the normal equation has infinitely 
many solutions. However, one can still apply the formula (4.3) where (YD~!W)~! 
is a pseudo-inverse of the matrix Y-!W!. 


The ML approach leads to the parameter estimate 6. Note that due to the 


model (4.1), the product f = W7@ is an estimate of the mean f* © EY of 


the vector of observations Y: 
f=V $= (usw) 'w>'y =ny, 
where 

=v (wo wl) ws! 


is an” X n matrix (linear operator) in IR”. The vector f is called a prediction or 
response regression estimate. 
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Below we study the properties of the estimates @ and f. In this study we try to 
address both types of possible model misspecification: due to a wrong assumption 
about the error distribution and due to a possibly wrong linear parametric structure. 
Namely we consider the model 


Yi=fitei, @~N(O, Xo). (4.4) 


The response values f; are usually treated as the value of the regression function 
f(-) at the design points X;. The parametric model (4.1) can be viewed as an 
approximation of (4.4) while & is an approximation of the true covariance matrix 
Do. If f* is indeed equal to ' @* and © = Do, then 6 and f are MLEs, otherwise 
quasi MLEs. In our study we mostly restrict ourselves to the case | assumption 
about the noise e: e ~ N(0,07J,,). The general case can be reduced to this one by a 
simple data transformation, namely, by multiplying the Eq. (4.4) ¥ = f* + e with 
the matrix ©~!/2, see Sect. 4.6 for more detail. 


4.2.1 Estimation Under the Homogeneous Noise Assumption 


If a homogeneous noise is assumed, that is = o*I, ande ~ N(0,o07/,,), then the 
formulae for the MLEs 6, f slightly simplify. In particular, the variance o7 cancels 
and the resulting estimate is the ordinary least squares (oLSE): 


6 =(ww') ‘wy =sy 
with S = (VWT) 'W. Also 
f =v" (ww) ‘vy =nY 


with TI = YT (WWT)'w. 


Exercise 4.2.1. Derive the formulae for 6, id directly from the log-likelihood L(6) 
for homogeneous noise. 


If the assumption e ~ N(0,07J,,) about the errors is not precisely fulfilled, then 
the oLSE can be viewed as a quasi MLE. 
4.2.2 Linear Basis Transformation 


Denote by vi. hy Vv) the rows of the matrix Y. Then the w,’s are vectors in IR” 
and we call them the basis vectors. In the linear regression case the y;’s are obtained 
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as the values of the basis functions at the design points. Our linear parametric 
. . : * 

assumption simply means that the underlying vector f* can be represented as a 

linear combination of the vectors ¥,,...,W>: 


ST =OHWi +...+ 5 V,- 


In other words, f* belongs to the linear subspace in IR” spanned by the vectors 
W\,---,W,- It is clear that this assumption still holds if we select another basis in 
this subspace. 

Let U be any linear orthogonal transformation in R? with UU' = I p- Then the 
linear relation f* = Y'6* can be rewritten as 


f* =w'uu ot =u 
with UV = U'W and u* = U'6*. Here the columns of YW mean the new basis 


vectors y,,, in the same subspace while u* is the vector of coefficients describing 
the decomposition of the vector f * w.r.t. this new basis: 


fe suv, t+...tuty,. 
The natural question is how the expression for the MLEs @ and 7 changes with 
the change of the basis. The answer is straightforward. For notational simplicity, we 
only consider the case with © = o7J,. The model can be rewritten as 
Y=W'u*+e 
yielding the solutions 
a=(¥W)'by=Sy, f=¥T(bd) 


where YW = UT W implies 


This yields 


and moreover, the estimate f is not changed for any linear transformation of 
the basis. The first statement can be expected in view of 0* = Uu*, while the 
second one will be explained in the next section: IT is the linear projector on the 
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subspace spanned by the basis vectors and this projector is invariant w.r.t. basis 
transformations. 


Exercise 4.2.2. Consider univariate polynomial regression of degree p — 1. This 
means that f is a polynomial function of degree p—1 observed at the points X; with 
errors €; that are assumed to be i.i.d. normal. The function f can be represented as 


f(x) = OF + OFx +... + OF xP! 


using the basis functions W(x) = x’! form = 0,..., p — 1. At the same time, 
for any point Xo, this function can also be written as 


f(x) = uf +45 (x — x0) + 2 BUX = x0)? 


using the basis functions Ym, = (x — x0)"—!. 


* Write the matrices VW and YW" and similarly Wand Ww. 
¢ Describe the linear transformation A such that u = A@ for p = 1. 
¢ Describe the transformation A such thatu = A@ for p > 1. 


Hint: use the formula 


1 
x (m—1) _ 
“mn = Gaal (xo), m=1,...,p 
to identify the coefficient u;, via O,,,...,05. 


4.2.3 Orthogonal and Orthonormal Design 


Orthogonality of the design matrix VY means that the basis vectors y1,..., Wp are 
orthonormal in the sense 


0 ifms¢m’, 


EB at / 
Am ifm=n’’, 


Vn Vine = Yo Vmi ni = 


i=l 
for some positive values A;,...,4,. Equivalently one can write 
Ww" = A = diag(Ay,...,Ap). 


This feature of the design is very useful and it essentially simplifies the computation 
and analysis of the properties of 0. Indeed, YW" = A implies 


6=A ‘WY, f=v'6=W'A'wWY 
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with A~! = diag(A7!, ss ae In particular, the first relation means 


An = Ke ye YiVmi. 


i=l 


that is, 9, is the scalar product of the data and the basis vector w,,, form = 1,..., p. 
The estimate of the response f reads as 


f =O, +... + Op. 


Theorem 4.2.1. Consider the model Y = W'@ + & with homogeneous errors e: 
Eee! = 07I,. If the design W is orthogonal, that is, if YY" = A for a diagonal 
matrix A, then the estimated coefficients Om are uncorrelated: Var() =A. 
Moreover, if e ~ N(0,07I,), then 0 ~ N(0*,02A7'). 


An important message of this result is that the orthogonal design allows for 
splitting the original multivariate problem into a collection of independent univariate 
problems: each coefficient @* is estimated by On independently on the remaining 
coefficients. 

The calculus can be further simplified in the case of an orthogonal design with 
wul=] p- Then one speaks about an orthonormal design. This also implies that 
every basis function (vector) ¥,, is standardized: ||¥,,|?> = 7’, v3; = 1. In 
the case of an orthonormal design, the estimate 6 is particularly simple: 6 = WY. 
Correspondingly, the target of estimation 6* satisfies 9* = W f™*. In other words, 
the target is the collection (0*) of the Fourier coefficients of the underlying function 
(vector) f* w.r.t. the basis W while the estimate 6 is the collection of empirical 
Fourier coefficients 6,,,: 


On = 2 fv > Brn = 2 Viv 


An important feature of the orthonormal design is that it preserves the noise 
homogeneity: 


Var(6) as ae 


4.2.4 Spectral Representation 


Consider a linear model 


Y=W'6O+e (4.5) 
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with homogeneous errors €: Var(e) = o7I,. The rows of the matrix W can be 
viewed as basis vectors in IR” and the product W ' @ is a linear combinations of these 
vectors with the coefficients (0,,..., @,). Effectively linear least squares estimation 
does a kind of projection of the data onto the subspace generated by the basis 
functions. This projection is of course invariant w.r.t. a basis transformation within 
this linear subspace. This fact can be used to reduce the model to the case of an 
orthogonal design considered in the previous section. Namely, one can always find 
a linear orthogonal transformation U : R? — R? ensuring the orthogonality of the 
transformed basis. This means that the rows of the matrix V = UW are orthogonal 
and the matrix UW" is diagonal: 


wh’ = UwwluT =A = diag(Ay,...,A>). 
The original model reads after this transformation in the form 
Y=W'ue, wu'=aA, 


where u = U@ e€ R?. Within this model, the transformed parameter uw can be 
T re 
Y, where y,, is the 


mth row of W,m = 1,..., p. The original parameter vector @ can be recovered via 
the equation @ = U'u. This set of equations can be written in the form 


estimated using the empirical Fourier coefficients Z,, = Vv, 


Z=Aut+Als (4.6) 


where Z = WY = UWY isa vector in R? and & = A~!/2We = A?U We € 
R?. Equation (4.6) is called the spectral representation of the linear model (4.5). 
The reason is that the basic transformation U can be built by a singular value 
decomposition of VY. This representation is widely used in context of linear inverse 
problems; see Sect. 4.8. 


Theorem 4.2.2. Consider the model (4.5) with homogeneous errors &, that is, 
Eee! = o7I,. Then there exists an orthogonal transform U : R? — RP? 
leading to the spectral representation (4.6) with homogeneous uncorrelated errors 
&: Eee! = oy. Ife ~ N(0,07I,), then the vector & is normal as well: 
& = N(0,07/,). 


Exercise 4.2.3. Prove the result of Theorem 4.2.2. 
Hint: select any U ensuring U' WW'U = A. Then 
bee’ = A? uy bRee WUT AT? = PAT? UT Ww UA? = 07 T,. 


A special case of the spectral representation corresponds to the orthonormal 
design with WW! = J p- In this situation, the spectral model reads as Z = u + &, 
that is, we simply observe the target w corrupted with a homogeneous noise &. Such 
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an equation is often called the sequence space model and it is intensively used in the 
literature for the theoretical study; cf. Sect. 4.7 below. 


4.3 Properties of the Response Estimate f 


This section discusses some properties of the estimate f = w'@ = OY of 
the response vector f*. It is worth noting that the first and essential part of the 
analysis does not rely on the underlying model distribution, only on our parametric 
assumptions that f = W'@* and Cov(e) = © = o7J,. The real model only 
appears when studying the risk of estimation. We will comment on the cases of 
misspecified f and &. 

When © = o7/,, the operator IT in the representation f = IIY of the estimate 
f reads as 


m= wl (ww')'y, (4.7) 


First we make use of the linear structure of the model (4.1) and of the estimate 
f to derive a number of its simple but important properties. 


4.3.1 Decomposition into a Deterministic and a Stochastic 
Component 


The model equation Y = f* +e yields 


f =1Y =l(f* +e) =T1f* + Me. (4.8) 


II 


The first element of this sum, II f* is purely deterministic, but it depends on the 
unknown response vector f*. Moreover, it will be shown in the next lemma that 
Il f* = f* if the parametric assumption holds and the vector f* indeed can be 
represented as W' @*. The second element is stochastic as a linear transformation of 
the stochastic vector € but is independent of the model response f*. The properties 
of the estimate f heavily rely on the properties of the linear operator IT from (4.7) 
which we collect in the next section. 


4.3.2 Properties of the Operator TI 


Let w,,....W p be the columns of the matrix UW". These are the vectors in R” also 
called the basis vectors. 
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Lemma 4.3.1. Let the matrix UW" be non-degenerate. Then the operator 1 fulfills 
the following conditions: 


(i) Il is symmetric (self-adjoint), that is, nm’ =T1. 
(ii) TI is a projector in R", i.e. 'o = Ml? = Mand (1, — M1) = 0, where 1, 
means the unity operator in R". 
(iii) For an arbitrary vector v from R", it holds |\v||? = ||T1v||? + |lv — Tvl’. 
(iv) The trace of II is equal to the dimension of its image, tr TI = p. 
(v) II projects the linear space IR" on the linear subspace Lp = (vi. stig V >) 
which is spanned by the basis vectors W,,...W p, that is, 


IE ile ek cee | ils a 
geLp 
(vi) The matrix II can be represented in the form 
N=U'A,U 


where U is an orthonormal matrix and A py is a diagonal matrix with the first 
p diagonal elements equal to | and the others equal to zero: 


A, = diag{1,...,1,0,..., O}. 
—_—— eo” 


Proof. It holds 


and 
T? = wT (ww) wl (wel) = wT (ew) =o, 


which proves the first two statements of the lemma. The third one follows directly 
from the first two. Next, 


=e (Ww) Ww =o eT (wT) =, =p. 
The second property means that IT is a projector in IR” and the fourth one means 
that the dimension of its image space is equal to p. The basis vectors w,,...,W, 
are the rows of the matrix W. It is clear that 


TeT = wT (wt) wT = wT, 


Therefore, the vectors y,,, are invariants of the operator IT and in particular, all these 
vectors belong to the image space of this operator. If now g is a vector in Lp, then 
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it can be represented as g = c\W,; +... + cpW, and therefore, Ig = g and 
ITIL, = Lp. Finally, the non-singularity of the matrix Ww! means that the vectors 
Wi,--.,W, forming the rows of W are linearly independent. Therefore, the space 
L, spanned by the vectors w,,..., W, is of dimension p, and hence it coincides 
with the image space of the operation IT. 

The last property is the usual diagonal decomposition of a projector. 


Exercise 4.3.1. Consider the case of an orthogonal design with WW! = J p: Spec- 
ify the projector IT of Lemma 4.3.1 for this situation, particularly its decomposition 
from (vi). 


4.3.3, Quadratic Loss and Risk of the Response Estimation 


In this section we study the quadratic risk of estimating the response f *. The reason 
for studying the quadratic risk of estimating the response f* will be made clear 
when we discuss the properties of the fitted likelihood in the next section. 

The loss o(f’, f*) of the estimate f can be naturally defined as the squared 
norm of the difference f — f*: 


ef f°) =F — f°? =A - AP. 
i=l 
Correspondingly, the quadratic risk of the estimate 7 is the mean of this loss 
RF) = Ep. f*) = ELS - FS — f*))- (4.9) 


The next result describes the loss and risk decomposition for two cases: when the 
parametric assumption f* = W'@* is correct and in the general case. 


Theorem 4.3.1. Suppose that the errors ¢; from (4.1) are independent with Khe; = 
0 and Ee? = 0°, i.e. & = 07S. Then the loss p(f , f*) = ||T1Y — f*|? and the 
risk R(f ) of the LSE f fulfill 
o(f. f*) =f" -—T1F*|? + Hel), 
RF) =F" — TF"? + po?. 


Moreover, if f * = V' 0%, then 


ef. f*) = |Tell’, 
Rf) = po. 
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Proof. We apply (4.9) and the decomposition (4.8) of the estimate f. It follows 


of. f*) =F - £7 |? =F" - Tf" — ell? 
= || f* —Tsf* |) +2(f* —T1f*)' Me + ||Mell’. 


This implies the decomposition for the loss of f by Lemma 4.3.1(ii). Next we 
compute the mean of ||ITe||? applying again Lemma 4.3.1. Indeed 


Ib||Me||? = E(Me)' We = Etr{Me(Me)"} = Etr(Mee 'T1") 
=i MRee")I\ =o" tl?) = po. 


Now consider the case when f* = Y'6*. By Lemma 4.3.1 f* = II f* and and 
the last two statements of the theorem clearly follow. 


4.3.4 Misspecified “Colored Noise” 


Here we briefly comment on the case when e is not a white noise. So, our 
assumption about the errors ¢; is that they are uncorrelated and homogeneous, that 
is, 2 = o7I,, while the true covariance matrix is given by Eo. Many properties of 
the estimate i = ITY which are simply based on the linearity of the model (4.1) 
and of the estimate f itself continue to apply. In particular, the loss olf, f *) a 


| f — f* ||? can again be decomposed as 
If — 7? = se" Ts P + (el. 


Theorem 4.3.2. Suppose that Ike = 0 and Var(e) = Xo. Then the loss ef. f) 
and the risk R(f ) of the LSE f fulfill 


ef. f*) =f" — Ws"? + Tell, 
RF) = | F* — Tf * |? + (Moll). 
Moreover, if f * = V' 0%, then 
o(f.f*) = |\Hel’, 
R(f) = tr(M1 XI). 


Proof. The decomposition of the loss from Theorem 4.3.1 only relies on the 
geometric properties of the projector I and does not use the covariance structure of 
the noise. Hence, it only remains to check the expectation of || Ie ||?. Observe that 
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E||Me||? = Etr[Me(Me)"] = t[ME(ee')H] = t(M DoH) 


as required. 


4.4 Properties of the MLE 6 


In this section we focus on the properties of the quasi MLE 6 built for the idealized 
linear Gaussian model Y = ©'60* + e with e ~ N(0,07I,,). As in the previous 
section, we do not assume the parametric structure of the underlying model and 
consider a more general model Y = f* +e with an unknown vector f* and errors 
e with zero mean and covariance matrix Yo. Due to (4.3), it holds 6 = SY with 
S= (wT) W. An important feature of this estimate is its linear dependence on 
the data. The linear model equation Y = f* +e and linear structure of the estimate 
6 = SY allow us for decomposing the vector 6 into a deterministic and stochastic 
terms: 


6 =SY =S(f* +e) =Sf*+Se. (4.10) 


The first term S f* is deterministic but depends on the unknown vector f* while 
the second term Se is stochastic but it does not involve the model response f*. 
Below we study the properties of each component separately. 


4.4.1 Properties of the Stochastic Component 


The next result describes me distributional properties of the stochastic component 
6 = Se forS = (wut) W and thus, of the estimate 6. 


Theorem 4.4.1. Assume Y = f* + e with Ee = 0 and Var(e) = Xo. The 
stochastic component § = Se in (4.10) fulfills 


Eés-0, wee 


Var(S) = SXoS', E|[6|? = tr W? = tr(SXoS"). 
Moreover, if & = Xp = 07 I, then 

W2=07 (WUT), BIS? = e(W?) = 0 w[ (HHT). (4.11) 
Similarly for the estimate 6 it holds 


E6=Sf*, — Var(0) = W?. 
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If the errors ¢ are Gaussian, then both § and 6 are Gaussian as well: 
5~NO,W?) O~N(Sf*,W?). 
Proof. For the variance W? of & holds 
Var(8) = E65" = ESee'S' =SXS". 


Next we use that E||6||? = Eé6'S = Etr($8') = tW2. If © = Yo = oy, 
then (4.11) follows by simple algebra. 

If e is a Gaussian vector, then 6 as its linear transformation is Gaussian as well. 
The properties of 6 follow directly from the decomposition (4.10). 


With Xo 4 071, the variance W? can be represented as 
W? = (WW) WEowT (WHT), 


Exercise 4.4.1. Let 6 be the stochastic component of @ built for the misspecified 
linear model Y = w'@* + e with Var(e) = D. Let also the true noise variance is 
Xo. Then Var(@) = W? with 


W? = (WETWT) WSs Stel (WE wT), (4.12) 


The main finding in the presented study is that the stochastic part 6 = Se of 
the estimate 6 is completely independent of the structure of the vector f*. In other 
words, the behavior of the stochastic component 6 does not change even if the linear 
parametric assumption is misspecified. 


4.4.2 Properties of the Deterministic Component 


Now we study the deterministic term starting with the parametric situation f* = 
W'@*. Here we only specify the results for the case 1 with © = o7J,,. 


Theorem 4.4.2. Let f* = W'0*. Then 6 = SY with S = (WW")'W is 
unbiased, that is, E86 = S f* = 0*. 


Proof. For the proof, just observe that S f* = (wT) wwTo* =6*. 


Now we briefly discuss what happens when the linear parametric assumption is 
not fulfilled, that is, f * cannot be represented as W' @*. In this case it is not yet clear 
what 0 really estimates. The answer is given in the context of the general theory of 
minimum contrast estimation. Namely, define 0* as the point which maximizes the 
expectation of the (quasi) log-likelihood L(@): 


0* = argmax EL(@). (4.13) 
6 


4.4 Properties of the MLE 0 133 
Theorem 4.4.3. The solution 0* of the optimization problem (4.13) is given by 
0* =Sf* =(ww') ‘we*. 
Moreover, 
wet =f* =v (wu')ws*. 


In particular, if f* = U' 0%, then 0* follows (4.13). 


Proof. The use of the model equation Y = f* + e and of the properties of the 
stochastic component 6 yield by simple algebra 


argmax EL(0) = argminE(f* — ¥'@ + e)'(f* —W'@ +e) 
f 0 


argmin{(f* —W'@)'(f* —W'0) + E(e'e)} 
0 


argmin{(f* — We)" (f* —w'@)}. 
0 


Differentiating w.r.t. 9 leads to the equation 
w(f*—wv'6)=0 


and the solution 6* = (WHT) W f* which is exactly the expected value of 6 by 
Theorem 4.4.1. 


Exercise 4.4.2. State the result of Theorems 4.4.2 and 4.4.3 for the MLE @ built in 
the model Y = W'6* + e with Var(e) = D. 
Hint: check that the statements continue to apply with S = (WX~! wry L\ ae 


The last results and the decomposition (4.10) explain the behavior of the estimate 
6 ina very general situation. The considered model is Y = f* + e. We assume 
a linear parametric structure and independent homogeneous noise. The estimation 
procedure means in fact a kind of projection of the data Y on a p-dimensional linear 
subspace in IR" spanned by the given basis vectors ,,..., W,. This projection, as 
a linear operator, can be decomposed into a projection of the deterministic vector 
f* and a projection of the random noise e. If the linear parametric assumption 
Sf ely... Y ,)is correct, thatis, f* = 0*w,+...+6> Ww, then this projection 
keeps f * unchanged and only the random noise is reduced via this projection. If f * 
cannot be exactly expanded using the basis y,,..., w,, then the procedure recovers 
the projection of f* onto this subspace. The latter projection can be written as 
W'@* and the vector @* can be viewed as the target of estimation. 
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4.4.3 Risk of Estimation: R-Efficiency 


This section briefly discusses how the obtained properties of the estimate 6 can 
be used to evaluate the risk of estimation. A particularly important question is the 
optimality of the MLE 6. The main result of the section claims that @ is R-efficient 
if the model is correctly specified and is not if there is a misspecification. 

We start with the case of a correct parametric specification Y = U' @* + e, that 
is, the linear parametric assumption f* = W'@* is exactly fulfilled and the noise 
e is homogeneous: e ~ N(0, 07). Later we extend the result to the case when the 
LPA f* = W'6* is not fulfilled and to the case when the noise is not homogeneous 
but still correctly specified. Finally we discuss the case when the noise structure is 
misspecified. 

Under LPA Y = W'6* +6 withe ~ N(0, 07J,), the estimate 6 is also normal 
with mean 6* and the variance W? = o?SS! = o2(WwT) Define a p x p 
symmetric matrix D by the equation 


le 1 
a wl — —wwT 
D? = op aa eke 
i=l 
ClearlyW*=D™, 

Now we show that @ is R-efficient. Actually this fact can be derived from 
the Cramér—Rao Theorem because the Gaussian model is a special case of an 
exponential family. However, we check this statement directly by computing the 


Cramér-Rao efficiency bound. Recall that the Fisher information matrix F(@) for 
the log-likelihood L(@) is defined as the variance of VL(@) under Pg. 


Theorem 4.4.4 (Gauss—Markov). LetY = W'0*+ewithe ~ N(0, 071). Then 
6 is R-efficient estimate of 0*: E@ = 0*, 


(0 — 0*)(@— 0*)"] = var(@) =D, 
and for any unbiased linear estimate 6 satisfying E96 = 0, it holds 


Var(6) > Var(6) =D”. 


Proof. Theorems 4.4.1 and 4.4.2 imply that 6 ~ N(@*,W?) with W2 = 
o?(WW')-! = D~?. Next we show that for any 0 


Var[VL(0)] = D?, 


that is, the Fisher information does not depend on the model function f*. The log- 
likelihood L(@) for the model Y ~ N(W'6*, 071) reads as 


4.4 Properties of the MLE 6 135 


L(0) = ee —we) (vy —wre)— 5 log(2x0”). 
This yields for its gradient VL(@): 
VL(0) =o °W(Y —v'6) 
and in view of Var(Y) = © = o7 Jy, it holds 
Var VL(6)] = o 4 W Var(Y)W' =o? wht 


as required. 

The R-efficiency 6 follows from the Cramér—Rao efficiency bound because 
{Var(8)}"' = a Var{VL(0)}. However, we present an independent proof of this fact. 
Actually we prove a sharper result that the variance of a linear unbiased estimate 6 
coincides with the variance of 6 only if 6 coincides almost surely with 6, otherwise 
it is larger. The idea of the proof is quite simple. Consider the difference 6 — 6 and 
show that the condition E@ = E@ = 6* implies orthogonality E{6 (6 — 6 y= =0. 
This, in turns, implies Var(0) = = Var(6) + Var(4 - a) 2 > Var(0). So, it remains 
to check the orthogonality of 6 and 6 — 6. Let 6 = AY fora p Xn matrix 
A and E98 = @ and all 0. These two equalities and EY = W'9* imply that 

AW'9* = 0*, i. AW" is the identity PX p matrix. The same is true for 9 = SY 
yielding SU! = I ,. Next, in view of Eé = E6 = 6* 


E{(6 —6)6"} = B(A—S)ee™ST =07(A—S)¥T(WWT)! =0 


and the assertion follows. 


Exercise 4.4.3. Check the details of the proof of the theorem. Show that the 
statement Var(6) > Var(6) only uses that 6 is unbiased and that EY = W'9* 
and Var(Y) = O21. 


Exercise 4.4.4. Compute V7L(@). Check that it is non-random, does not depend 
on 6, and fulfills for every @ the identity 


V*L(6) = — Var[VL(0)| = —D°. 


4.4.3.1 A Colored Noise 


The majority of the presented results continue to apply in the case of heterogeneous 
and even dependent noise with Var(eé) = Xo. The key facts behind this extension 
are the decomposition (4.10) and the properties of the stochastic component 6 from 
Sect. 4.4.1:65 ~ N(O, W?). In the case of a colored noise, the definition of W and 
D is changed for 
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Dw? =wr'w". 


Exercise 4.4.5. State and prove the analog of Theorem 4.4.4 for the colored noise 
e ~ N(0, Xo). 


4.4.3.2 A Misspecified LPA 


An interesting feature of our results so far is that they equally apply for the correct 
linear specification f * = Y'@* and for the case when the identity f* = W'@ is 
not precisely fulfilled whatever @ is taken. In this situation the target of analysis is 
the vector 6* describing the best linear approximation of f * by are We already 
know from the results of Sects. 4.4.1 and 4.4.2 that the estimate @ is also normal 
with mean 0* = Sf* = (wT) ws and the variance W2 = o?SS' = 
o2 (WT). 

Theorem 4.4.5. Assume Y = f* +ewithe ~ N(0,07I,). Let0* =Sf*. Then 
6 is R-efficient estimate of 0*: E@ = 6%, 


E[(6 — 6*)(6 — 6*) "] = Var(6) = D~, 
and for any unbiased linear estimate 6 satisfying E94 = 8, it holds 
Var(6) > Var(6) =D”. 
Proof. The proofs only utilize that @ ~ N(@*,W2) with W2 = D~. The only 


small remark concerns the equality Var| VL(8 ) = D? from Theorem 4.4.4. 


Exercise 4.4.6. Check the identity Var[ VL(0) | = D? from Theorem 4.4.4 for 
e ~ N(O, Xo). 


4.4.4 The Case of a Misspecified Noise 


Here we again consider the linear parametric assumption Y = © ' @* + &. However, 
contrary to the previous section, we admit that the noise e is not homogeneous 
normal: ¢ ~ N(0, Xo) while our estimation procedure is the quasi MLE based on 
the assumption of noise homogeneity e ~ N(0,07J,,). We already know that the 
estimate @ is unbiased with mean 6* and variance W2 = S XoS', where S = 
(WWT) 'W. This gives 


W? = (WT) WEowT (wT), 
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The question is whether the estimate @ based on the misspecified distributional 
assumption is efficient. The Cramér—Rao result delivers the lower bound for the 


quadratic risk in form of Var(0) > [Var(VL(8))]. We already know that the 
use of the correctly specified covariance matrix of the errors leads to an R-efficient 
estimate . The next result show that the use of a misspecified matrix & results in 
an estimate which is unbiased but not R-efficient, that is, the best estimation risk is 
achieved if we apply the correct model assumptions. 


Theorem 4.4.6. Let Y = U'0* + @ with e ~ N(O, Xo). Then 
Var| VL(6)] = do w". 


The estimate @ = (wT) ‘wy is unbiased, that is, E6 = 0*, but it is not R- 
efficient unless Xp = &. 


Proof. Let 60 be the MLE for the correct model specification with the noise e ~ 
N(0, Xo). As @ is unbiased, the difference @ — @o is orthogonal to 6 and it holds 
for the variance of 6 


Var(8) = Var(00) + Var(@ — 80): 


cf. with the proof of Gauss—Markov-Theorem 4.4.4. 


Exercise 4.4.7. Compare directly the variances of 6 and of 0. 


4.5 Linear Models and Quadratic Log-Likelihood 


Linear Gaussian modeling leads to a specific log-likelihood structure; see Sect. 4.2. 
Namely, the log-likelihood function L(@) is quadratic in 0, the coefficients of the 
quadratic terms are deterministic and the cross term is linear both in @ and in the 
observations Y;. Here we show that this geometric structure of the log-likelihood 
characterizes linear models. We say that L(@) is quadratic if it is a quadratic 
function of @ and there is a deterministic symmetric matrix D? such that for any 
0°,0 


L(0) — L(0°) = (0 — 0°) 'VL(0°) — (0 — 0°) ' D2(6 — 0°) /2. (4.14) 
Here VL(6) “ ee As usual we define 
62 argmax L(@), 
) 


0* = argmax EL(@). 
6 
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The next result describes some properties of the estimate @ which are entirely based 
on the geometric (quadratic) structure of the function L(@). All the results are stated 
by using the matrix D? and the vector ¢ = VL(6*). 


Theorem 4.5.1. Let L(0) be quadratic for a matrix D? > 0. Then for any 0° 
6—-0° =D~VL(6°). (4.15) 


In particular, with 0° = 0, it holds 


Taking 0° = 0* yields 
6-0*=D¢ (4.16) 
with ¢ = VL(0*). Moreover, K€ = 0, and it holds with V* = Var(€) = Eee! 
6 = 0* 
Var(6) = D?V*D~. 
Further, for any 0, 
L(0) — L(0) = (0 —0)' D?(6 — 6)/2 = ||D(6 — 8)||7/2. (4.17) 


Finally, it holds for the excess L(0, 0*) “ L(6) — L(0*) 
2L(0,0*) = (6—0*)"D°(6—-O*)=E'D 7G = lg? (4.18) 


with € = D~'€. 


Proof. The extremal point equation VL(#) = 0 for the quadratic function L(6) 
from (4.14) yields (4.15). Equation (4.14) with 0° = @* implies for any 0 


VL(0) = VL(0°) — D?(0 — 0°) = — D?(6 — O*). (4.19) 
Therefore, it holds for the expectation EL(@) 
VEL(@) = E¢ — D°(6 — 6*), 


and the equation VEL(0*) = 0 implies Eg = 0. 7 
To show (4.17), apply again the property (4.14) with 0° = 0: 


L(0) — L(0) = (0 —6)'VL(6) — (0 — 8)' D7(0 — 8) /2 
= —(6 —0)'D?(6 —6)/2. 
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Here we used that VL(6 ) = 0 because 6 is an extreme point of L(@). The last 
result (4.18) is a special case with 6 = 0* in view of (4.16). 


This theorem delivers an important message: the main properties of the MLE 
6 can be explained via the geometric (quadratic) structure of the log-likelihood. 
An interesting question to clarify is whether a quadratic log-likelihood structure 
is specific for linear Gaussian model. The answer is positive: there is one-to- 
one correspondence between linear Gaussian models and quadratic log-likelihood 
functions. Indeed, the identity (4.19) with 0° = 6* can be rewritten as 


VL(0)+ D?6=f£+D76*. 
If we fix any @ and define Y = VL(0) + D768, this yields 
Y = D’9* +6. 


def 


Similarly, Y = D7! {VL(@) + D°6\ yields the equation 


Y=D0*+&, (4.20) 


where € = D~'€. We can summarize as follows. 


Theorem 4.5.2. Let L(@) be quadratic with a non-degenerated matrix D?. Then 


y= D"{VL(0) + D°6} does not depend on 0 and L(0@) — L(0*) is the quasi 


log-likelihood ratio for the linear Gaussian model (4.20) with & standard normal. It 
is the true log-likelihood if and only if § ~ N(O, D?). 


Proof. The model (4.20) with € ~ N(O, I,) leads to the log-likelihood ratio 
(0 — 0*)' D(Y — D6*) —||D(6 — 6*)|"/2 = (0 — 0*) "§ — ||D(O — 6*) |" /2 
in view of the definition of Y. The definition (4.14) implies 
L(@) — L(6*) = (0 — 0*)"VL(0*) — || D(@ — 6*)|7/2. 


As these two expressions coincide, it follows that L(@) is the true log-likelihood if 
and only if € = D~'¢ is standard normal. 


4.6 Inference Based on the Maximum Likelihood 


All the results presented above for linear models were based on the explicit 
representation of the (quasi) MLE @. Here we present the approach based on the 
analysis of the maximum likelihood. This approach does not require to fix any 
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analytic expression for the point of maximum of the (quasi) likelihood process 
L(@). Instead we work directly with the maximum of this process. We establish 
exponential inequalities for the excess or the maximum likelihood L(6,0*). We 
also show how these results can be used to study the accuracy of the MLE 6, in 
particular, for building confidence sets. 

One more benefit of the ML-based approach is that it equally applies to a 
homogeneous and to a heterogeneous noise provided that the noise structure is 
not misspecified. The celebrated chi-squared result about the maximum likelihood 
L(@, 0*) claims that the distribution of 2L(0, 0*) is chi-squared with p degrees of 
freedom ie and it does not depend on the noise covariance; see Sect. 4.6. 

Now we specify the setup. The starting point of the ML-approach is the linear 
Gaussian model assumption Y = ¥'@* + e with e ~ N(0, ). The corresponding 
log-likelihood ratio L(@) can be written as 


L(0) = FY — v0) TE — VTA) +R, (4.21) 


where the remainder term R does not depend on 6. Now one can see that L(@) is a 
quadratic function of @. Moreover, V7L(0) = VX~!W", so that L(@) is quadratic 
with D? = YY—!W". This enables us to apply the general results of Sect. 4.5 which 
are only based on the geometric (quadratic) structure of the log-likelihood L(@): the 
true data distribution can be arbitrary. 


Theorem 4.6.1. Consider L(@) from (4.21). For any @, it holds with D? = 
woly 
L(0,0) = (6 —0)' D?(6 — 6)/2. (4.22) 


In particular, if & = oly then the fitted log-likelihood is proportional to the 
quadratic loss || f — f g\\’ for f = V'@ and fy = V'0: 


2 


D 1 . - 
L@,0) = 5 |W" —@)| = \f—Fal- 


1 
5a 


ror = argmaxy EL(0) = D~*WE"! f* for f* = EY, then 


2L(6,0*) =¢' De = |é | (4.23) 


with € = VL(0*) andé © D='6. 


Proof. The results (4.22) and (4.23) follow from Theorem 4.5.1; see (4.17) 
and (4.18). 


If the model assumptions are not misspecified, one can establish the remarkable 
2 
x result. 
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Theorem 4.6.2. Let L(@) from (4.21) be the log-likelihood for the model Y = 
WT0* + € with e ~ N(0, 2). Then& = D'€ ~ N(0,I,) and2L(6,0*) ~ a; is 
chi-squared with p degrees of freedom. 


Proof. By direct calculus 
¢=VL(6*) =r (Y —W'o*) = wre. 


So, ¢ is a linear transformation of a Gaussian vector Y and thus it is Gaussian as 
well. By Theorem 4.5.1, EG = 0. Moreover, Var(eé) = & implies 


Var(¢) = EW' Slee DWT = wy! = D? 


yielding that € = D~'€ is standard normal. 


The last result 2L(0,0*) ~ Le is sometimes called the “chi-squared phe- 
nomenon”: the distribution of the maximum likelihood only depends on the number 
of parameters to be estimated and is independent of the design W, of the noise 
covariance matrix &, etc. This particularly explains the use of word “phenomenon” 
in the name of the result. 


Exercise 4.6.1. Check that the linear transformation Y = =7'/?¥Y of the data 
does not change the value of the log-likelihood ratio L(@,0*) and hence, of the 
maximum likelihood L(6, 0*). 

Hint: use the representation 


L(0) 


1 
5 = wats l(y—wW'a)+R 


1 ~ v v v 
ied —wWle)'(Y—W'6)+R 


and check that the transformed data Y is described by the model Y=W'o* +8 
with W = WD-!/? andé = D'/2e ~ N(O,I,,) yielding the same log-likelihood 
ratio as in the original model. 


Exercise 4.6.2. Assume homogeneous noise in (4.21) with © = o7J,. Then it 
holds 


2L(6,0*) =o *||Me|l 


where I] = w! (wut)! W is the projector in IR” on the subspace spanned by the 


vectors Wj,...,W)- 
Hint: use that € = o~?We, D? =o 2 WW", and 


o?||Mel|? =o eT ' Me =o 7e' He =¢'DE. 
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We write the result of Theorem 4.6.1 in the form 2L(0,0*) ~ pe where yi 
stands for the chi-squared distribution with p degrees of freedom. This result can 
be used to build likelihood-based confidence ellipsoids for the parameter 0*. Given 
3 > 0, define 


€(3) = {0 L(0,8) <3} = {9 : sup L(’) — L(@) < i}. (4.24) 
6’ 


Theorem 4.6.3. Assume Y = Wl6* + @ with e ~ N(O, 5) and consider the MLE 
0. Define 3q by PUG, > 23a) = a. Then E(3q) from (4.24) is an a-confidence set 
for 6*. 


Exercise 4.6.3. Let D? = WD7!w"'. Check that the likelihood-based CS €(3q) 
and estimate-based CS E(zy) = {6 : ||D(@ — 0)|| < zu}, 2% = 23a, coincide in the 
case of the linear modeling: 


E(3a) = {8 : |D@ —8)|° < 2a}. 


Another corollary of the chi-squared result is a concentration bound for the 
maximum likelihood. A similar result was stated for the univariate exponential 
family model: the value L(@,0*) is stochastically bounded with exponential 
moments, and the bound does not depend on the particular family, parameter 
value, sample size, etc. Now we can extend this result to the case of a linear 
Gaussian model. Indeed, Theorem 4.6.1 states that the distribution of 2L(6, 0*) 
is chi-squared and only depends on the number of parameters to be estimated. The 
latter distribution concentrates on the ball of radius of order p!/? and the deviation 
probability is exponentially small. 


Theorem 4.6.4. Assume Y = U'6* + € with e ~ N(0, =). Then for every x > 0, 
it holds with k > 6.6 


P(2L(6, 0*) > p+ Juxp V (xx)) 
= P(|D6 — a*)||” > p+ /fuxp Vv (xx)) < exp(—x). (4.25) 


Proof. Define & e D(6— 6*). By Theorem 4.4.4 & is standard normal vector in IR? 
and by Theorem 4.6.1 2L(0,0*) = ||&||?. Now the statement (4.25) follows from 
the general deviation bound for the Gaussian quadratic forms; see Theorem A.2.1. 


The main message of this result can be explained as follows: the deviation 
probability that the estimate 6 does not belong to the elliptic set E(z) = {0 : 
||D(@ — @)|| < z} starts to vanish when z? exceeds the dimensionality p of the 
parameter space. Similarly, the coverage probability that the true parameter 0* is 
not covered by the confidence set €(3) starts to vanish when 23 exceeds p. 
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Corollary 4.6.1. Assume Y = U'0* +e with e ~ N(O, d). Then for every x > 0, 
it holds with 23 = p+ ./uxp V (xx) for x > 6.6 


P(E(G3) 7 8") < exp(—x). 


Exercise 4.6.4. Compute 3 ensuring the covering of 95 % in the dimension p = 
1,2, 10, 20. 


4.6.1 A Misspecified LPA 


Now we discuss the behavior of the fitted log-likelihood for the misspecified linear 
parametric assumption EY = W'@*. Let the response function f* not be linearly 
expandable as f* = W'@*. Following to Theorem 4.4.3, define 0* = Sf* 
with S = (Wx! we) ws, This point provides the best approximation of the 
nonlinear response f* by a linear parametric fit U' 6. 


Theorem 4.6.5. Assume Y = f* + € withe ~ N(O, dX). Let 0* = Sf*. Then 6 
is an R-efficient estimate of 0* and 


2L(6,0*) =" DE =e" ~ x2, 


where D? = WD! WT, € = VL(O*) = WD, € = D~'E is standard normal 
vector in R? and Le is a chi-squared random variable with p degrees of freedom. 
In particular, E(3q) is an a-CS for the vector 0* and the bound of Corollary 4.6.1 
applies. 


Exercise 4.6.5. Prove the result of Theorem 4.6.5. 


4.6.2 A Misspecified Noise Structure 


This section addresses the question about the features of the maximum likelihood 
in the case when the likelihood is built under a wrong assumption about the noise 
structure. As one can expect, the chi-squared result is not valid anymore in this 
situation and the distribution of the maximum likelihood depends on the true 
noise covariance. However, the nice geometric structure of the maximum likelihood 
manifested by Theorems 4.6.1 and 4.6.3 does not rely on the true data distribution 
and it is only based on our structural assumptions on the considered model. This 
helps to get rigorous results about the behaviors of the maximum likelihood and 
particularly about its concentration properties. 


Theorem 4.6.6. Let 6 be built for the model Y = W'0* + € withe ~ N(0, d), 
while the true noise covariance is Xp: Ke = 0 and Var(e) = Xo. Then 
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6 = 6*, 
Var(6) = D?W?D~, 


where 


D2 =v>!wT, 
W2 = vr yd! wT. 


Further, 
2L(8,6*) = ||D@ — 6*)| = |, (4.26) 
where & is a random vector in IR? with EE = 0 and 
Var(é) = B&D w2pD-". 


Moreover, if ¢ ~ N(0, Xo), then 6 ~ N(O*, D~W?D~*) and & ~ N(0, B). 


Proof. The moments of 6 have been computed in Theorem 4.5.1 while the equality 
2L(0,0*) = ||D(@ — 6*)||? = |l&||? is given in Theorem 4.6.1. Next, ¢ = 
VL(0*) = U>d'e and 


wW2 © var(¢) = VER! Var(e)EWT = WO Y_U wT, 
This implies that 
Var(é) = E&&' = D7! Var(é)D 7! = D"'w?pD"". 


It remains to note that if e is a Gaussian vector, then € = VX~'e, & = D~'€, and 


6 —0* = D~€ are Gaussian as well. 
Exercise 4.6.6. Check that X9 = leads back to the y?-result. 


One can see that the chi-squared result is not valid any more if the noise structure 
is misspecified. An interesting question is whether the CS €(3) can be applied 
in the case of a misspecified noise under some proper adjustment of the value 3. 
Surprisingly, the answer is not entirely negative. The reason is that the vector & 
from (4.26) is zero mean and its norm has a similar behavior as in the case of the 
correct noise specification: the probability P(||é || > z) starts to degenerate when 
2 exceeds E|é ||’. A general bound from Theorem A.2.2 in Sect. A.1 implies the 
following bound for the coverage probability. 


Corollary 4.6.2. Under the conditions of Theorem 4.6.6, for every x > 0, it holds 
with p = tr(B), v? = 2tr(B’), and a* = || Blloo 
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P(2L(6,0*) > p + (2vx!/”) v (6a*x)) < exp(—x). 


Exercise 4.6.7. Show that an overestimation of the noise in the sense © > Yo 
preserves the coverage probability for the CS €(3.), that is, if 234 is the 1 —a@ 
quantile of y7,, then P(E(3a) ¥ 0*) <a. 


4.7 Ridge Regression, Projection, and Shrinkage 


This section discusses the important situation when the number of predictors wp ; 
and hence the number of parameters p in the linear model Y = V'0* + e is not 
small relative to the sample size. Then the least square or the maximum likelihood 
approach meets serious problems. The first one relates to the numerical issues. The 
definition of the LSE @ involves the inversion of the Pp X p matrix WW" and such 
an inversion becomes a delicate task for p large. The other problem concerns the 
inference for the estimated parameter 0*. The risk bound and the width of the 
confidence set are proportional to the parameter dimension p and thus, with large p, 
the inference statements become almost uninformative. In particular, if p is of order 
the sample size n, even consistency is not achievable. One faces a really critical 
situation. We already know that the MLE is the efficient estimate in the class of 
all unbiased estimates. At the same time it is highly inefficient in overparametrized 
models. The only way out of this situation is to sacrifice the unbiasedness property 
in favor of reducing the model complexity: some procedures can be more efficient 
than MLE even if they are biased. This section discusses one way of resolving these 
problems by regularization or shrinkage. To be more specific, for the rest of the 
section we consider the following setup. The observed vector Y follows the model 


Y=ftte (4.27) 


with a homogeneous error vector ¢: Ee = 0, Var(e) = o7I,. Noise misspecifica- 
tion is not considered in this section. 

Furthermore, we assume a basis or a collection of basis vectors W,,...,W p 1S 
given with p large. This allows for approximating the response vector f = EY in 
the form f = Y' 6%, or, equivalently, 


f=, +...+0Y¥,- 


In many cases we will assume that the basis is already orthogonalized: BU! = I p 
The model (4.27) can be rewritten as 


Y=W'@*+e, Var(e)=07!,. 


The MLE or oLSE of the parameter vector 0* for this model reads as 
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6=(WWT) ‘WY, f=V'O=W"(WwW') DY. 


If the matrix YW" is degenerate or badly posed, computing the MLE 6 is a hard 
task. Below we discuss how this problem can be treated. 


4.7.1 Regularization and Ridge Regression 


Let R be a positive symmetric p x p matrix. Then the sum YW" + R is positive 
symmetric as well and can be inverted whatever the matrix W is. This suggests to 
replace (wT) by (Wwl+ R)' leading to the regularized least squares estimate 
6 x of the parameter vector 6 and the corresponding response estimate f R: 


a 1 


6x = (WHT + RP 2 


WY, fp : 


www +R) WY. (4.28) 
Such a method is also called ridge regression. An example of choosing R is the 
multiple of the unit matrix: R = aI, where a > 0 and J, stands for the unit matrix. 
This method is also called Tikhonov regularization and it results in the parameter 


estimate 6, and the response estimate f ,: 
6,2 (WUT +al,) VY, fy, 2 wu'(wwl+al,) VY. — (4.29) 


A proper choice of the matrix R for the ridge regression method (4.28) or the 
parameter a for the Tikhonov regularization (4.29) is an important issue. Below we 
discuss several approaches which lead to the estimate (4.28) with a specific choice of 
the matrix R. The properties of the estimates 6 x and 7 pr Will be studied in context 
of penalized likelihood estimation in the next section. 


4.7.2. Penalized Likelihood: Bias and Variance 


The estimate (4.28) can be obtained in a natural way within the (quasi) ML approach 
using the penalized least squares. The classical unpenalized method is based on 
minimizing the sum of residuals squared: 


6 = argmax L(0) = arginf ||Y — ¥'@||? 
6 0 


with L(@) = o?||¥Y — ©" @||?/2. (Here we omit the terms which do not depend 
on 6.) Now we introduce an additional penalty on the objective function which 
penalizes for the complexity of the candidate vector 9 which is expressed by the 
value ||G0 ||? /2 for a given symmetric matrix G. This choice of complexity measure 
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implicitly assumes that the vector 9 = 0 has the smallest complexity equal to 
zero and this complexity increases with the norm of G@. Define the penalized log- 
likelihood 


Le(@) = L(6) - ||G0\?/2 
= —(207)!|]¥ — w' @||? — |Ge|?/2— (n/2) log(207). (4.30) 
The penalized MLE reads as 


6¢ = argmax Lg (0) = argmin{(207) "| ||¥ —wl@|?+ |Go||7/2}. 
9 9 


A straightforward calculus leads to the expression (4.28) for 6G with R = 02G?: 
6¢ = (Ww +.0°G’) UY. (4.31) 


We see that 0c is again a linear estimate: 0G = SGY with Sg = (owt + 


0?G2) |W, The results of Sect. 4.4 explain that 6 in fact estimates the value 0G 
defined by 


0G 


argmax EL Gg (0) 
6 


II 


arginf E{||Y —W'6|? +07 ||Ga||"} 

= (WW +.0°G’) ‘Wht = So f*. (4.32) 
In particular, if f* = W'*, then 

6g = (WUT +0°G?) 'wwTo* (4.33) 


and 6¢ 4 6* unless G = O. In other words, the penalized MLE 6c is biased. 


Exercise 4.7.1. Check that E0. = 0, for 0. = (UY" + a1,) YW"O", the 
bias ||6., — @*|| grows with the regularization parameter a. 


The penalized MLE 6G leads to the response estimate fe =W'0g. 
Exercise 4.7.2. Check that the penalized ML approach leads to the response 
estimate 


fo =V'0¢ =U" (WWl +0°G’) WY =T¢Y 


with Ig = wit + 0G?) Ww. Show that ITg is a sub-projector in the sense 
that ||[1gul| < ||u|| for any u € R”. 
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Exercise 4.7.3. Let be orthonormal: bw! = J p- Then the penalized MLE 0G 
can be represented as 


6¢ =(U)+0°G’)'Z, 


where Z = WY is the vector of empirical Fourier coefficients. Specify the result for 
the case of a diagonal matrix G = diag(g1,..., gp) and describe the corresponding 
response estimate f ¢. 


The previous results indicate that introducing the penalization leads to some bias of 
estimation. One can ask about a benefit of using a penalized procedure. The next 
result shows that penalization decreases the variance of estimation and thus makes 
the procedure more stable. 


Theorem 4.7.1. Let 6c be a penalized MLE from (4.31). Then E6G = 6G, 
see (4.33), and under noise homogeneity Var(¢) = o7 In, it holds 


Var(6¢) = (0 2WW" + G2) 'o PWT (0 2 WT + G?) 
= De DeDe 


with D2, = o~2WwWl + G*. In particular, Var(6G) < DG?. If ¢ ~ N(0,07In), 
then 0G is also normal: 0c ~N(OG, Dz? D? DG’). 

Moreover, the bias ||@ g — 9* || monotonously increases in G* while the variance 
monotonously decreases with the penalization G. 


Proof. The first two moments of 6G are computed from 6g =ScY. Monotonicity 
of the bias and variance of 0g is proved below in Exercise 4.7.6. 


Exercise 4.7.4. Let © be orthonormal: WW" = J p- Describe Var(0G). Show that 
the variance decreases with the penalization G in the sense that G; > G implies 
Var(@G,) < Var(@«G). 


Exercise 4.7.5. Let PW! = I, and let G = diag(g,..., gp) be a diagonal 
matrix. Compute the squared bias ||@¢ — 0*||? and show that it monotonously 
increases in each g; for j = 1,..., p. 


Exercise 4.7.6. Let G be a symmetric matrix and 6G the corresponding penalized 
MLE. Show that the variance Var(0 G) decreases while the bias ||6 ¢ —*|| increases 
in G’. 
Hint: with D? = o~?WW"', show that for any vector w € R? andu = D7'w, it 
holds 


w! Var(6G)w =u'(,+D'G?D")7u 


and this value decreases with G? because Int D~'G?D7' increases. Show in a 
similar way that 
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cq — 8" |? = ||(D? + G?)1G6*|? = ot T 10" 


with T = (1, + G-*D’)(I, + D’G~”). Show that the matrix T monotonously 
increases and thus [~! monotonously decreases as a function of the symmetric 
matrix B = G~. 


Putting together the results about the bias and the variance of 6G yields the 
statement about the quadratic risk. 


Theorem 4.7.2. Assume the model Y = W10* + & with Var(e) = 071. Then the 
estimate 0g fulfills 


E||6¢ — 6* |? = [0g — 0* | + (Dg D7 DG’). 


This result is called the bias-variance decomposition. The choice of a proper 
regularization is usually based on this decomposition: one selects a regularization 
from a given class to provide the minimal possible risk. This approach is referred to 
as bias-variance trade-off. 


4.7.3 Inference for the Penalized MLE 


Here we discuss some properties of the penalized MLE 6G.In particular, we focus 
on the construction of confidence and concentration sets based on the penalized log- 
likelihood. We know that the regularized estimate 6G is the empirical counterpart 
of the value 0G which solves the regularized deterministic problem (4.32). We also 
know that the key results are expressed via the value of the supremum sup, Lg(@)— 
Lg(6G). The next result extends Theorem 4.6.1 to the penalized likelihood. 


Theorem 4.7.3. Let Lg(@) be the penalized log-likelihood from (4.30). Then 


2LG(6c.0c) = (6c — 9G) | D2 (0c — 8c) (4.34) 


oe'Tge (4.35) 


with Tg = YT (WHT + 0?G2) 'W, 
In general the matrix ITg is not a projector and hence, oe ' Tg ¢ is not y?- 
distributed, the chi-squared result does not apply. 


Exercise 4.7.7. Prove (4.34). 7 

Hint: apply the Taylor expansion to Lg(@) at 0g. Use that VLg(@G) = 0 and 
—-V°L¢(0) =a 70! + G’. 

Exercise 4.7.8. Prove (4.35). 

Hint: show that 6g — 0g = Sge with Sg = (VW! + 0G) 'W. 
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The straightforward corollaries of Theorem 4.7.3 are the concentration and 
confidence probabilities. Define the confidence set Eg (3) for 0g as 


def a 
Ea(3) = {0 : Le(0c.9) <3}. 
The definition implies the following result for the coverage probability: 
P(EgG) Z 9G) < P(Le (0G. 4G) > 3)- 


Now the representation (4.35) for Loa. 0G) reduces the problem to a deviation 
bound for a quadratic form. We apply the general result of Sect. A.1. 


Theorem 4.7.4. Let Lg(@) be the penalized log-likelihood from (4.30) and let e ~ 
N(0,07I,). Then it holds with pg = tr(I1g) and ve = 2 tr(T1Z) that 


P(2LG(0G.9G) > pg + 2vgx"”) Vv (6x)) < exp(—x). 
Similarly one can state the concentration result. With De =o?Wwl+G 
2LG(8G.0G) = ||Da(0c — 6c)|| 
and the result of Theorem 4.7.4 can be restated as the concentration bound: 
P(||Dg(6G — 94)? > pg + (2vex"/7) v (6x)) < exp(—x). 


In other words, 6 concentrates on the set A(z, 0G) = {0 : || — Og|° < 23} for 
23 > Da. 


4.7.4 Projection and Shrinkage Estimates 


Consider a linear model Y = W'@* + e in which the matrix W is orthonormal 
in the sense UW! = J p- Then the multiplication with WY maps this model in the 
sequence space model Z = 6* + &, where Z = WY = (z,... a) is the vector 
of empirical Fourier coefficients z; = vj} Y. The noise € = We borrows the feature 
of the original noise e: if e is zero mean and homogeneous, the same applies to &. 
The number of coefficients p can be large or even infinite. To get a sensible estimate, 
one has to apply some regularization method. The simplest one is called projection: 
one just considers the first m empirical coefficients z,,..., z», and drop the others. 
The corresponding parameter estimate 6, reads as 


5 a ay sm, 
On, j = 2 
0 otherwise. 


4.7 Ridge Regression, Projection, and Shrinkage 151 


The response vector f* = EY is estimated by WTO, leading to the representation 


ie = avy Tee + ZmW mn 


with 7; = WY. In other words, cm is just a projection of the observed vector 
Y onto the subspace L,, spanned by the first m basis vectors W,,...,W,,: Ln = 
(wi. eee Vn): This explains the name of the method. Clearly one can study the 
properties of 8, or Tx using the methods of previous sections. However, one more 
question for this approach is still open: a proper choice of m. The standard way of 
accessing this issue is based on the analysis of the quadratic risk. 

Consider first the prediction risk defined as RF m) = BE] fn — f*||?. Below we 
focus on the case of a homogeneous noise with Var(e) = o7/,. An extension to 
the colored noise is possible. Recall that fn effectively estimates the vector f,,, = 
Il, f*, where I, is the projector on Ly; see Sect. 4.3.3. Moreover, the quadratic 
risk R(f,,,) can be decomposed as 


Dp 
RF m) = If * — Ure dalle + o*m = o’m + > a 
j=m+1 


Obviously the squared bias || f* — I, f*||? decreases with m while the variance 
o”m linearly grows with m. Risk minimization leads to the so-called bias-variance 
trade-off: one selects m which minimizes the risk R(f.,,,) over all possible m: 


m* 2 argmin R(f,,,) = argmin{|| f* —T1, f*||? + om}. 


Unfortunately this choice requires some information about the bias || f* — 1, f *|| 
which depends on the unknown vector f. As this information is not available in 
typical situation, the value m™* is also called an oracle choice. A data-driven choice 
of m is one of the central issues in the nonparametric statistics. 

The situation is not changed if we consider the estimation risk E]|6 — 6* ||. 
Indeed, the basis orthogonality YW" = J, implies for f* = V'0* 


WF im — FU? = YT Om — YT O* ||? = Om — O* |? 


and minimization of the estimation risk coincides with minimization of the predic- 
tion risk. 

A disadvantage of the projection method is that it either keeps each empirical 
coefficient z,, or completely discards it. An extension of the projection method is 
called shrinkage: one multiplies every empirical coefficient z; with a factor a; € 
(0, 1). This leads to the shrinkage estimate 6 with 


Oe, j = AjZj- 
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Here @ stands for the vector of coefficients a; for 7 = 1,..., p. A projection 
method is a special case of this shrinkage with a; equal to one or zero. Another 
popular choice of the coefficients a; is given by 


a; = (1— j/m)’1(j <m) (4.36) 


for some 8 > 0 and m < p. This choice ensures that the coefficients a; smoothly 
approach zero as j approaches the value m, and a; vanish for j > m. In this case, 
the vector a is completely specified by two parameters m and f. The projection 
method corresponds to 8 = 0. The design orthogonality PW’ = J p yields 
again that the estimation risk E||6 — 9*||? coincides with the prediction risk 
Bly g=v 

Exercise 4.7.9. Let Var(e) = 07 ,. The risk RF ») of the shrinkage estimate a 
fulfills 


P P 
RF) = El fa — S71 = OP —ajy + Dao? 


j=l j=l 


Specify the cases of a = a(m, B) from (4.36). Evaluate the variance term )> j ao. 


Hint: approximate the sum over j by the integral {(1 — x/ my? dx. 


The oracle choice is again defined by risk minimization: 
« def 7 i 
a” = argminR(f,), 


where minimization is taken over the class of all considered coefficient vectors a. 

One way of obtaining a shrinkage estimate in the sequence space model Z = 
0* + is by using a roughness penalization. Let G be a symmetric matrix. Consider 
the regularized estimate 6, from (4.28). The next result claims that if G is a 
diagonal matrix, then OG isa shrinkage estimate. Moreover, a general penalized 
MLE can be represented as shrinkage by an orthogonal basis transformation. 


Theorem 4.7.5. Let G be a diagonal matrix, G = diag(g1...., 8»). The penalized 
MLE 0 in the sequence space model Z = 0* + & with E ~ N(O, o°I ,) coincides 
with the shrinkage estimate 0. for aj; =(1+ o*gt) < 1. Moreover, a penalized 
MLE 6¢ for a general matrix G can be reduced to a shrinkage estimate by a basis 
transformation in the sequence space model. 


Proof. The first statement for a diagonal matrix G follows from the representation 
6¢ = (I, + 0°G*)"'Z. Next, let U be an orthogonal transform leading to the 
diagonal representation G? = U' D?U with D? = diag(g1,..., g)). Then 
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U6g =(Ip+0°D*) 'UZ 


that is, UO is a shrinkage estimate in the transformed model UZ = U0* + U&. 


In other words, roughness penalization results in some kind of shrinkage. 
Interestingly, the inverse statement holds as well. 


Exercise 4.7.10. Let 04 is a shrinkage estimate for a vector a = (a;). Then there 
is a diagonal penalty matrix G such that 04 = 0G. 
Hint: define the jth diagonal entry g; by the equation a; = (1 + 0 Come 


4.7.5 Smoothness Constraints and Roughness Penalty 
Approach 


Another way of reducing the complexity of the estimation procedure is based 
on smoothness constraints. The notion of smoothness originates from regression 
estimation. A nonlinear regression function f is expanded using a Fourier or 
some other functional basis and @* is the corresponding vector of coefficients. 
Smoothness properties of the regression function imply certain rate of decay of 
the corresponding Fourier coefficients: the larger frequency is, the fewer amount 
of information about the regression function is contained in the related coefficient. 
This leads to the natural idea to replace the original optimization problem over the 
whole parameter space with the constrained optimization over a subset of “smooth” 
parameter vectors. Here we consider one popular example of Sobolev smoothness 
constraints which effectively means that the sth derivative of the function f* has a 
bounded L2-norm. A general Sobolev ball can be defined using a diagonal matrix G: 


Bg(R) = ||GO|| < BR. 


Now we consider a constrained ML problem: 


O¢.r = argmax L(0) = argmin ||Y —W'@|/?. (4.37) 
0€BG(R) 0€0: ||GO||<R 


The Lagrange multiplier method leads to an unconstrained problem 


6G, = argmin{||Y — ¥'6||? + A]GO|?%. 
6 


A proper choice of A ensures that the solution Oca belongs to Bg(R) and 
solves also the problem (4.37). So, the approach based on a Sobolev smoothness 
assumption leads back to regularization and shrinkage. 


154 4 Estimation in Linear Models 
4.8 Shrinkage in a Linear Inverse Problem 


This section extends the previous approaches to the situation with indirect observa- 
tions. More precisely, we focus on the model 


Y=Af* +e, (4.38) 


where A is a given linear operator (matrix) and f* is the target of analysis. With 
the obvious change of notation this problem can be put back in the general linear 
setup Y = W'@ +e. The special focus is due to the facts that the target can be high 
dimensional or even functional and that the product A' A is usually badly posed 
and its inversion is a hard task. Below we consider separately the cases when the 
spectral representation for this problem is available and the general case. 


4.8.1 Spectral Cut-Off and Spectral Penalization: Diagonal 
Estimates 


Suppose that the eigenvectors of the matrix A! A are available. This allows for 
reducing the model to the spectral representation by an orthogonal change of the 
coordinate system: Z = Au+A!/?& with a diagonal matrix A = diag{A),... Ap} 
and a homogeneous noise Var(&) = oJ; see Sect. 4.2.4. Below we assume 
without loss of generality that the eigenvalues A; are ordered and decrease with /. 
This spectral representation means that one observes empirical Fourier coefficients 
Zm described by the equation z; = Aju; + are, for 7 = 1,..., p. The LSE or 
qMLE estimate of the spectral parameter u is given by 


eS BO a  aigsahs Bp) 
Exercise 4.8.1. Consider the spectral representation Z = Au + A'/?&. The LSEa 


reads asu = A7'!Z. 


If the dimension p of the model is high or, specifically, if the spectral values 
A; rapidly go to zero, it might be useful to only track few coefficients u1,..., Um 
and to set all the remaining ones to zero. The corresponding estimate u,, = 
(Um1,..., leg reads as 


~ def JAj'zy iff <m, 
0 otherwise. 


It is usually referred to as a spectral cut-off estimate. 


Exercise 4.8.2. Consider the linear model Y = A f* + e. Let U be an orthogonal 
transform in R? providing UA'AU' = A with a diagonal matrix A leading to 
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the spectral representation for Z = UAY. Write the corresponding spectral cut-off 
estimate f,,, for the original vector f'*. Show that computing this estimate only 
requires to know the first m eigenvalues and eigenvectors of the matrix A! A. 


Similarly to the direct case, a spectral cut-off can be extended to spectral 
shrinkage: one multiplies every empirical coefficient z; with a factor a; € (0,1). 
This leads to the spectral shrinkage estimate ty with Uy,; = 0; AV % j-Here a stands 
for the vector of coefficients a; for 7 = 1,..., p. A spectral cut-off method is a 
special case of this shrinkage with a; equal to one or zero. 


Exercise 4.8.3. Specify the spectral shrinkage #, with a given vector « for the 
situation of Exercise 4.8.2. 


The spectral cut-off method can be described as follows. Let W,,%5,... be 
the intrinsic orthonormal basis of the problem composed of the standardized 
eigenvectors of A' A and leading to the spectral representation Z = Au + A!/?& 
with the target vector u. In terms of the original target f*, one is looking for a 
solution or an estimate in the form f = )> j 4; ;- The design orthogonality allows 
to estimate every coefficient u; independently of the others using the empirical 
Fourier coefficient vi} Y. Namely, “7; = Aj wy}Y = A; is The LSE procedure 
tries to recover f as the full sum f = ; 4jW;- The projection method suggests 
to cut this sum at the index m: = => j<m4jW ;, while the shrinkage procedure 
is based on downweighting the empirical coefficients u;: f, = )~ Lok ae 

Next we study the risk of the shrinkage method. Orthonormality of the basis 
yp ; allows to represent the loss as ||u_ — u*|? = ||f. — f*|\?. Under the noise 
homogeneity one obtains the following result. 


Theorem 4.8.1. Let Z = Au* + A'/?& with Var(&) = 071. It holds for the 
shrinkage estimate ty 


P P 
~ \ def 74) ~ 2 = 
R(fig) Ellie — u* ||? = Jay — 1Put? +S 0307A51. 


j=l j=l 
Proof. The empirical Fourier coefficients z; are uncorrelated and Ez; = A ju; 
Var z; = 07A;. This implies 
P P 
ins #2 —1 *)2 __ 2 *2 2 24 —1 
Ellit, —u*||? = > ElojAj'z; — uF)? = flay — 1Put* +070727"} 
j=l j=l 


as required. 


Risk minimization leads to the oracle choice of the vector w or 
o* = argmin R(it,) 
a 


where the minimum is taken over the set of all admissible vectors a. 
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Similar analysis can be done for the spectral cut-off method. 


Exercise 4.8.4. The risk of the spectral cut-off estimate w,, fulfills 


m P 
RGN= Aro Se 
j=l 


j=m+1 


Specify the choice of the oracle cut-off index m*. 


4.8.2. Galerkin Method 


A general problem with the spectral shrinkage approach is that it requires to 
precisely know the intrinsic basis f,, W>, ... or equivalently the eigenvalue decom- 
position of A leading to the spectral representation. After this basis is fixed, one 
can apply the projection or shrinkage method using the corresponding Fourier 
coefficients. In some situations this basis is hardly available or difficult to compute. 
A possible way out of this problem is to take some other orthogonal basis @,, @>,... 
which is tractable and convenient but does not lead to the spectral representation 
of the model. The Galerkin method is based on projecting the original high 
dimensional problem to a lower dimensional problem in terms of the new basic 
{od j3- Namely, without loss of generality suppose that the target function f * can be 
decomposed as 


f* => ujg;. 


J 


This can be achieved, e.g., if {* belongs to some Hilbert space and {@ ji is 
an orthonormal basis in this space. Now we cut this sum and replace this exact 
decomposition by a finite approximation 


f° 3 fin = >, 4b) = Dhan 


jm 
where Un = (U},..., ie)” and the matrix ®,, is built of the vectors @,,...,@,,: 
®,, = (@),...,,,). Now we plug this decomposition in the original equation Y = 


Af* + e. This leads to the linear model Y = A®l un +ée= VANE + e with 
WU, = ®,,A'. The corresponding (quasi) MLE reads 


: -1 
itm = (Un) Un. 
Note that for computing this estimate one only needs to evaluate the action of the 


operator A on the basis functions $,,...,@,, and on the data Y. With this estimate 
iu, of the vector u*, one obtains the response estimate f,, of the form 
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to = ®lm = U9) +... + Um, - 


The properties of this estimate can be studied in the same way as for a general 
qMLE in a linear model: the true data distribution follows (4.38) while we use the 
approximating model Y = Af* + € with e ~ N(0,07/) for building the quasi 
likelihood. 

A further extension of the qMLE approach concerns the case when the operator 
A is not precisely known. Instead, an approximation or an estimate A is available. 
The pragmatic way of tackling this problem is to use the model Y = A Site 
for building the quasi likelihood. The use of the Galerkin method is quite natural in 
this situation because the spectral representation for A will not necessarily result in 
a similar representation for the true operator A. 


4.9 Semiparametric Estimation 


This section discusses the situation when the target of estimation does not coincide 
with the parameter vector. This problem is usually referred to as semiparametric 
estimation. One typical example is the problem of estimating a part of the parameter 
vector. More generally one can try to estimate a given function/functional of the 
unknown parameter. We focus here on linear modeling, that is, the considered model 
and the considered mapping of the parameter space to the target space are linear. 
For the ease of presentation we assume everywhere the homogeneous noise with 
Var(e) = 07 Ty. 


4.9.1 (0,)- and v-Setup 


This section presents two equivalent descriptions of the semiparametric problem. 
The first one assumes that the total parameter vector can be decomposed into the 
target parameter @ and the nuisance parameter 7. The second one operates with the 
total parameter v and the target 6 is a linear mapping of v. 

We start with the (6, 7)-setup. Let the response Y be modeled in dependence of 
two sets of factors: {y;, 7 = 1,..., p} and {@,,,m = 1,..., pi}. We are mostly 
interested in understanding the impact of the first set { p ; } but we cannot ignore the 
influence of the {@,,,}’s. Otherwise the model would be incomplete. This situation 
can be described by the linear model 


Y=wW'o*ioly* +e, (4.39) 
where W is the p xn matrix with the columns w ;, while ® is the p; x m-matrix with 


the columns @,,,. We primarily aim at recovering the vector 0*, while the coefficients 
n* are of secondary importance. The corresponding (quasi) log-likelihood reads as 
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L(0,n) = —2o07) "!¥ —W'o—O' yl? +R, 


where R denotes the remainder term which does not depend on the parameters 0, 7. 
The more general v-setup considers a general linear model 


Y=Y'v* +e, (4.40) 


where Y is p* x n matrix of p* factors, and the target of estimation is a linear 
mapping @* = Pv* for a given operator P from R”" to R”. Obviously the (6, )- 
setup is a special case of the v-setup. However, a general v-setup can be reduced 
back to the (@, 7)-setup by a change of variable. 


Exercise 4.9.1. Consider the sequence space model Y = v* + & in R? and let 
the target of estimation be the sum of the coefficients uy’ +... + Use Describe the 
u-setup for the problem. Reduce to (6, 7)-setup by an orthogonal change of the 
basis. 


In the v-setup, the (quasi) log-likelihood reads as 
L£(v) = —(207)"]¥ — YT vl? + R, 


where R is the remainder which does not depend on v. It implies quadraticity of the 
log-likelihood £(v): given by 


D? = —V*L(v) = Var{VL(v)} =o? YY". (4.41) 


Exercise 4.9.2. Check the statements in (4.41). 


Exercise 4.9.3. Show that for the model (4.39) holds with Y = (3) 


T T 
pee ). (4.42) 


4.9.2 Orthogonality and Product Structure 


Consider the model (4.39) under the orthogonality condition wo!’ = 0. This 
condition effectively means that the factors of interest {py j3 are orthogonal to the 
nuisance factors {@,,,}. An important feature of this orthogonal case is that the model 
has the product structure leading to the additive form of the log-likelihood. Consider 
the partial @-model Y = W'@ + e with the (quasi) log-likelihood 
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L(0) = —(207) '|!¥ —W'olP +R 
Similarly L,(9) = —(207)~!||¥ — ®'y||? + R, denotes the log-likelihood in the 
partial n-model Y = ®'0 +e. 
Theorem 4.9.1. Assume the condition Y@' = 0. Then 


£(0,H) = L(O) + Li(m) + R(X), (4.43) 


where R(Y) is independent of 6 and y. This implies the block diagonal structure of 
the matrix D? =a0~?YY!: 


wut 0 D? 0 
— = 9” = 4.44 
id ( 0 oot) ( 0 ae ee) 


with D? =o ?WW!, H? =a 7oo!,” 
Proof. The formula (4.44) follows from (4.42) and the orthogonality condition. The 
statement (4.43) follows if we show that the difference 


L£(8,9) — L(0) — Lil) 


does not depend on @ and 7. This is a quadratic expression in 6, 9, so it suffices to 
check its first and the second derivative w.r.t. the parameters. For the first derivative, 
it holds by the orthogonality condition 


VoL(0,n) = dL(0,9)/00 = 0 7W(Y — 19 —&' yn) =o 7W(Y — W"8) 


that coincides with VL(@). Similarly Ve £L (0, 9) = VLi() yielding 
V{L(O,9) — L(8) — Li(y)} = 0. 
The identities (4.41) and (4.42) imply that V? {L, n) — L(@) — Li(m)} = 0. This 


implies the desired assertion. 


Exercise 4.9.4. Check the statement (4.43) by direct computations. Describe the 
term R(Y). 


Now we demonstrate how the general case can be reduced to the orthogonal one 
by a linear transformation of the nuisance parameter. Let C be a p x p; matrix. 
Define 7 = » + C'O. Then the model equation Y = W'0 + ®'n + & can be 
rewritten as 


Y=wW'o+4+0'(f-C'O)+e=(W—-CO)'04+ 0 H+e. 


Now we select C to ensure the orthogonality. This leads to the equation 
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(© —Co)6' =0 
orC = WoT(bOT)", So, the original model can be rewritten as 
Y=W'04+ +e, 
v= W-C&o=W(I-TI,), (4.45) 
where I], = oT (oo7)" ® being the projector on the linear subspace spanned by 
the nuisance factors {@,,,}. This construction has a natural interpretation: correction 


the @-factors w,,...,y, by removing their interaction with the nuisance factors 
$).---.,, reduces the general case to the orthogonal one. We summarize: 


Theorem 4.9.2. The linear model (4.39) can be represented in the orthogonal form 
Y=W'94O'H+e, 


where WV from (4.45) satisfies UVbT = 0 and nh = n+C!'O forC = 
WOT(OOT) Moreover, it holds for v = (0,1) 


L(v) = L(0) + Li(q) + R(Y) (4.46) 
with 


L(6) = —(207)"|¥ — WTO? + R, 
L(y) = —(207)"||¥ — ®"y||? + Ri. 


Exercise 4.9.5. Show that for C = Vd™(o")! 
VL(0) = VoL(v) — CV, £(v). 


Exercise 4.9.6. Show that the remainder term R(Y) in the Eq. (4.46) is the same 
as in the orthogonal case (4.43). 


Exercise 4.9.7. Show that UU" < WwW" if Uo! 40, 
Hint: for any vector y € R?, it holds with h = W'y 


yWbTy = |(v— ¥T,)"y |? = || — Th? < [Al?. 
Moreover, the equality here for any y is only possible if 
I1,h = 0'(@0') ‘'OWTy =0, 


that is, if DW' = 0. 
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4.9.3 Partial Estimation 


This section explains the important notion of partial estimation which is quite 
natural and transparent in the (6, )-setup. Let some value 7° of the nuisance 
parameter be fixed. A particular case of this sort is just ignoring the factors {@,,,} 
corresponding to the nuisance component, that is, one uses 7° = 0. This approach 
is reasonable in certain situation, e.g. in context of projection method or spectral 
cut-off. 

Define the estimate 0 (9°) by partial optimization of the joint log-likelihood 
£(6, n°) w.r.t. the first parameter 0: 


6 (n°) = argmax L£(0, 9°). 
6 


Obviously (n°) is the MLE in the residual model Y — ®' n° = W'0* +e: 
6(n°) = (WUT) WY — oT’). 


This allows for describing the properties of the partial estimate 6 (n°) similarly to 
the usual parametric situation. 


Theorem 4.9.3. Consider the model (4.39). Then the partial estimate 0 (4°) fulfills 
E6 (9°) = 0* + (WW) WOT (g*— 7°), VarfO(q?)} = 0? (WHT) 


In other words, 6(n) has the same variance as the MLE in the partial model Y = 
W'0* + & but it is biased if VO" (y* —17°) 4 0. The ideal situation corresponds to 


the case when 7° = 7*. Then 6 (y*) is the MLE in the correctly specified @-model: 


with ¥(n*) @ y — @'y*, 


Y(n*)= Wl 0* +e. 


An interesting and natural question is a legitimation of the partial estimation 
method: under which conditions it is justified and does not produce any estimation 
bias. The answer is given by Theorem 4.9.1: the orthogonality condition Yb" = 0 
would ensure the desired feature because of the decomposition (4.43). 


Theorem 4.9.4. Assume orthogonality Y®' = 0. Then the partial estimate 6(°) 
does not depend on the nuisance parameter n° used: 


6 = 6(n°) = 6(n*) = (WW") WY. 


In particular, one can ignore the nuisance parameter and estimate 0* from the 
partial incomplete model Y = V'0* + «. 
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Exercise 4.9.8. Check that the partial derivative LO, n) does not depend on n 
under the orthogonality condition. 


The partial estimation can be considered in context of estimating the nuisance 
parameter y by inverting the role of 0 and 7. Namely, given a fixed value 6°, one 
can optimize the joint log-likelihood £(0@, 7) w.r.t. the second argument 7 leading 
to the estimate 

n(0°) e argmax £(0°, 7) 
i] 


In the orthogonal situation the initial point 0° is not important and one can use the 
partial incomplete model Y = ©! y* + e. 


4.9.4 Profile Estimation 


This section discusses one general profile likelihood method of estimating the target 
parameter @ in the semiparametric situation. Later we show its optimality and R- 
efficiency. The method suggests to first estimate the entire parameter vector v 
by using the (quasi) ML method. Then the operator P is applied to the obtained 
estimate v to produce the estimate 6. One can describe this method as 


v = argmax L(v), 6 = Po. (4.47) 


The first step here is the usual LS estimation of v* in the linear model (4.40): 


b = arginf |Y —Y' vl? =(1Y") ‘YY. 


The estimate 6 is obtained by applying P to v: 
6=Piv=P(YY"') YY =sY (4.48) 


with S = P (Y 5 i ie Y. The properties of this estimate can be studied using the 
decomposition Y = f* + e with f* = EY; cf. Sect. 4.4. In particular, it holds 


Ed =Sf*, Var(0) = S Var(e)S'. (4.49) 
If the noise ¢ is homogeneous with Var(e) = 071, then 


Var(6) = 07SST =o? P(YY") PT. (4.50) 


The next theorem summarizes our findings. 
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Theorem 4.9.5. Consider the model (4.40) with homogeneous error Var(é) = 
o7I,. The profile MLE @ follows (4.48). Its means and variance are given by (4.49) 
and (4.50). 


The profile MLE is usually written in the (0, 7)-setup. Let v = (6,7). Then 
the target 6 is obtained by projecting the MLE (0,7) on the 0-coordinates. This 
procedure can be formalized as 


@ = argmax max £(6, 7). 
0 U] 


Another way of describing the profile MLE is based on the partial optimization 
considered in the previous section. Define for each @ the value L(@) by optimizing 
the log-likelihood £(v) under the condition Pv = 0: 

def 


L(0) = sup L(v) =supL(6,y). (4.51) 
v: Pu=0 n 


Then 6 is defined by maximizing the partial fit L(0): 


62 argmax L(@). (4.52) 
) 


Exercise 4.9.9. Check that (4.47) and (4.52) lead to the same estimate 6. 


We use for the function L(0) obtained by partial optimization (4.51) the same 
notation as for the function obtained by the orthogonal decomposition (4.43) in 
Sect. 4.9.2. Later we show that these two functions indeed coincide. This helps in 
understanding the structure of the profile estimate 6. 

Consider first the orthogonal case YS' = 0. This assumption gradually 
simplifies the study. In particular, the result of Theorem 4.9.4 for partial estimation 
can be obviously extended to the profile method in view of product structure (4.43): 
when estimating the parameter 6, one can ignore the nuisance parameter 9 and 
proceed as if the partial model Y = V' @* +e were correct. Theorem 4.9.1 implies: 


Theorem 4.9.6. Assume that Wo! = 0 in the model (4.39). Then the profile MLE 
6 from (4.52) coincides with the MLE from the partial model Y = V' 0* + e: 


6 = argmax L(6) = argmin||Y — v'@||? = (wwT) wy. 
p 0 


It holds E@ = 0* and 
6-0*=D~t¢=D"'é 


with D? = o ?WW', € = o ?We, and & = D~'€. Finally, L(0) from (4.51) 
Julfills 
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2{L() — L(6*)} = |D(6 —6*) |? =g' Ds = I. 


Moreover, if Var(e) = 071, then Var(é) = Ip. Ife ~ N(0,07I,), then & is 
standard normal in R?. 


The general case can be reduced to the orthogonal one by the construction from 
Theorem 4.9.2. Let 


v= U-UT, = V— vo'(o0")'® 


be the corrected W-factors after removing their interactions with the -factors. 


Theorem 4.9.7. Consider the model (4.39), and let the matrix D? = 02 WW is 
non-degenerated. Then the profile MLE 0 reads as 


6 = argmin||Y — 7 9|? = (WT) ‘wy. (4.53) 
6 


It holds E@ = 0* and 
We=D-F=D'é (4.54) 


with D? = 0 2Ww", c =o We, and & = pate, Furthermore, L(0) from (4.51) 
fulfills 


27.6) —L0*)} = ||D(6 -—9*) I? =E' DE =|EI?. 4.55) 


Moreover, if Var(e) = o71,, then Var(é) = 1,. Ife ~ N(0,o7I,), then é ~ 
N(0, Ip). 


Finally we present the same result in terms of the original log-likelihood £(v). 


Theorem 4.9.8. Write D? = —V*EL(v) for the model (4.39) in the block form 


D? A 
p= a - (4.56) 


Let D? and H* be invertible. Then D? and ia in (4.54) can be represented as 


D? = D?—AH~A', 
& = VeL(v*) —AH °V,L£(v"). 


Proof. In view of Theorem 4.9.7, it suffices to check the formulas for D? and é. 
One has for VW = W(,, —T,) and A =o 7WOT 
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D? =o 7 WW" =o 7 W(T, — 0,0" 
=o? Ww! —9 ?Wo'(00") ow! = D?- AHA. 
Similarly, by AH~* = WoT(bOT) |, VoL(v*) = We, and V,L(v*) = De 
& = We = We — VO" (60") de = Vy L(v*) — AHV, Lv"). 


as required. 


It is worth stressing again that the result of Theorems 4.9.6 through 4.9.8 is 
purely geometrical. We only used the condition Ke = 0 in the model (4.39) and 
the quadratic structure of the log-likelihood function £(v). The distribution of the 
vector € does not enter in the results and proofs. However, the representation (4.54) 
allows for straightforward analysis of the probabilistic properties of the estimate 6. 


Theorem 4.9.9. Consider the model (4.39) and let Var(Y) = Var(e) = No. Then 
Var(0) = 0 47D? WX! D~?, Var(é) =o 4 DIY! De“, 
In particular, if Var(Y ) = 071, this implies that 


Var(0) = D~?, —-Var(&) = Ip. 


Exercise 4.9.10. Check the result of Theorem 4.9.9. Specify this result to the 
orthogonal case Vb! = 0. 


4.9.5 Semiparametric Efficiency Bound 


The main goal of this section is to show that the profile method in the semiparametric 
estimation leads to R-efficient procedures. Remind that the target of estimation 
is 0* = Pv* for a given linear mapping P. The profile MLE 6 is one natural 
candidate. The next result claims its optimality. 


Theorem 4.9.10 (Gauss—Markoy). Let Y follow Y = Ylvu* + e for homoge- 
neous errors &. Then the estimate 0 of 0* = Pv* from (4.48) is unbiased and 


Var(6) = 0? P(YY') PT 
yielding 
E6 —0* |? =o7 r{P(TY") Ph. 


Moreover, this risk is minimal in the class of all unbiased linear estimates of 0*. 
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Proof. The statements about the properties of @ have been already proved. The 
lower bound can be proved by the same arguments as in the case of the MLE 
estimation in Sect. 4.4.3. We only outline the main steps. Let 6 be any unbiased 
linear estimate of 0*. The ioe is to show that the difference 6 — @ is orthogonal to 


6 in the sense E(6 = 6)" = 0. This implies that the variance of 6 is the sum of 
Var(0 ) and Var(4 - 6) and therefore larger than Var(6). 


Let 6 = BY for some matrix B. Then E6 = BEY = BY' v*. The no-bias 
property yields the identity E@ = 0* = Pv* and thus 


BY'-P=0. (4.57) 

Next, E@ = E6 = 6* and thus 
= 0*0*' + Var(0), 
= 0*6*' + 5(6 —E6)(6 — 56)". 
Obviously 6 —E6 = Be and 6 —E6 = Se yielding Var() = o°SS" and 
EBe(Se)' =o0?BS'.So 

B6 —6)6" =02(B—S)S™. 
The identity (4.57) implies 

(B=5)8' =(R=-P(rr') vir (rr) P* 

= (BY" — P)(TY") "PT =0 

and the result follows. 


Now we specify the efficiency bound for the (8, 7)-setup (4.39). In this case P 
is just the projector onto the 6-coordinates. 


4.9.6 Inference for the Profile Likelihood Approach 


This section discusses the construction of confidence and concentration sets for the 
profile ML estimation. The key fact behind this construction is the chi-squared 
result which extends without any change from the parametric to semiparametric 
framework. 
__ The definition @ from (4.52) suggests to define a CS for 0* as the level set of 
L(@) = sup, py=e £(v): 

def 


E(3) = {0 : LB) — LB) <3}. 
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This definition can be rewritten as 
def 
€(3) = {9 :sup£L(v)— sup L(v)< it. 
v v:Pv=0 


It is obvious that the unconstrained optimization of the log-likelihood £(v) w.rt. v 
is not smaller than the optimization under the constraint that Pv = @. The point 
0 belongs to €(3) if the difference between these two values does not exceed 3. 
As usual, the main question is the choice of a value 3 which ensures the prescribed 
coverage probability of @*. This naturally leads to studying the deviation probability 


P(sup £(v) — sup L(v)> 3). 
v v:Pv=0* 


Such a study is especially simple in the orthogonal case. The answer can be 
expected: the expression and the value are exactly the same as in the case without 
any nuisance parameter 7, it simply has no impact. In particular, the chi-squared 
result still holds. 

In this section we follow the line and the notation of Sect. 4.9.4. In particular, we 
use the block notation (4.56) for the matrix D* = —V*L(v). 


Theorem 4.9.11. Consider the model (4.39). Let the matrix D? be non- 
degenerated. If e ~ N(0,07I,), then 


2{L(0) — L(@*)} ~ x7, (4.58) 


that is, this 2{L(6) =10 *)} is chi-squared with p degrees of freedom. 


Proof. The result is based on representation (4.55) 2{L(0) —L(6 oa \|é ||? from 


Theorem 4.9.7. It remains to note that normality of ¢ implies normality of € and the 
moment conditions Ké = 0, Var(é) = I, imply (4.58). 


This result means that the chi-squared result continues to hold in the general 
semiparametric framework as well. One possible explanation is as follows: it applies 
in the orthogonal case, and the general situation can be reduced to the orthogonal 
case by a change of coordinates which preserves the value of the maximum 
likelihood. 

The statement (4.58) of Theorem 4.9.11 has an interesting geometric interpreta- 
tion which is often used in analysis of variance. Consider the expansion 


L(6) — L(0*) = L(6) — £(0*, n*) — {L(6*) — £(0*, 9*)}. 


The quantity £, = L(6) — £(v*) coincides with L(, v*); see (4.47). Thus, 
2£, chi-squared with p* degrees of freedom by the chi-squared result. Moreover, 
207L(0,v*) = ||yell?, where 1, = Da (at OL Waa 9 is the projector on the 


linear subspace spanned by the joint collection of factors {yp ;} and {@,, }. Similarly, 
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the quantity Ly “ L(0*)—L(0*, n*) = sup, £(0*, »)—£(0*, n*) is the maximum 
likelihood in the partial y-model. Therefore, 2£4 is also chi-squared distributed with 
DP) degrees of freedom, and 207L) = ||I1,e||?, where 1, = o'(60') '® is the 
projector on the linear subspace spanned by the y-factors {@,,}. Now we use the 
decomposition IT, = II, + I, — T,, in which I, — I, is also a projector on the 
subspace of dimension p. This explains the result (4.58) that the difference of these 
two quantities is chi-squared with p = p* — p, degrees of freedom. The above 
consideration leads to the following result. 


Theorem 4.9.12. It holds for the model (4.39) with Hye =T,—- Tl, 


2L(6) —2L(6*) = o> (|| Mvell’ — ||Hyell) 
=o? |Ilgel|? = oe! Tge. (4.59) 


Exercise 4.9.11. Check the formula (4.59). Show that it implies (4.58). 


4.9.7 Plug-In Method 


Although the profile MLE can be represented in a closed form, its computing can 
be a hard task if the dimensionality p; of the nuisance parameter is high. Here we 
discuss an approach which simplifies the computations but leads to a suboptimal 
solution. 

We start with the approach called plug-in. It is based on the assumption that 
a pilot estimate 4 of the nuisance parameter y is available. Then one obtains the 
estimate 6 of the target 0* from the residuals Y — O'9. 

This means that the residual vector Y = Y — ©" is used as observations and 
the estimate @ is defined as the best fit to such observations in the @-model: 


A 


6 = argmin||¥Y — Y"9|? = (WUT) ‘WY. (4.60) 
6 


A very particular case of the plug-in method is partial estimation from Sect. 4.9.3 
with 7 = 7°. 

The plug-in method can be naturally described in context of partial estimation. 
We use the following representation of the plug-in method: 6=06 (9). 


Exercise 4.9.12. Check the identity 6=6 (9) for the plug-in method. Describe the 
plug-in estimate for 7 = 0. 


The behavior of the 6 heavily depends upon the quality of the pilot 7. A detailed 
study is complicated and a closed form solution is only available for the special case 
of a linear pilot estimate. Let 7 = AY. Then (4.60) implies 


4.9 Semiparametric Estimation 169 
6 = (ww) ‘wy — oT AY) =SY 


with S = (wwT) WU, — @! A). This is a linear estimate whose properties can 
be studied in a usual way. 


4.9.8 Two-Step Procedure 


The ideas of partial and plug-in estimation can be combined yielding the so-called 
two-step procedures. One starts with the initial guess 0° for the target 0*. A very 
special choice is 0° = 0. This leads to the partial y-model Y(0°) = ®'yn + e 
for the residuals Y(@°) = Y — W'@°. Next compute the partial MLE 9(0°) = 


(007) '@Y (6°) in this model and use it as a pilot for the plug-in method: 
compute the residuals 


¥(0°) =Y—'9(6°) = Y —1,Y(0°) 


with H, = 0" (oo) ®, and then estimate the target parameter @ by fitting V' 0 
to the residuals Y (0°). This method results in the estimate 


(0°) = (Uw") WY (6°) (4.61) 


A simple comparison with the formula (4.53) reveals that the pragmatic two-step 
approach is sub-optimal: the resulting estimate does not fit the profile MLE 6 unless 
we have an orthogonal situation with WII, = 0. In particular, the estimate 6 (0°) 
from (4.61) is biased. 


Exercise 4.9.13. Consider the orthogonal case with Y&' = 0. Show that the two- 
step estimate 0 (6°) coincides with the partial MLE 6 = (ww) WY. 


Exercise 4.9.14. Compute the mean of 6 (6°). Show that there exists some 0* such 
that E{é (0 °)} # 0* unless the orthogonality condition ¥®' = 0 is fulfilled. 


Exercise 4.9.15. Compute the variance of 6(0 °). : 
Hint: use that Var{Y (0°)} = Var(Y) = 07/,,. Derive that Var{¥(0°)} = 07 (I, — 
T1,). 


Exercise 4.9.16. Let Y be orthogonal, ic. YW = I ,. Show that 


Var{6(6°)} = o?(1, — VII, W"). 
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4.9.9 Alternating Method 


The idea of partial and two-step estimation can be applied in an iterative way. One 
starts with some initial value for 6° and sequentially performs the two steps of 
partial estimation. Set 


flo = 90°) = argmin |Y — YT 0° — OT yl? = (007) O(Y — U9). 
i] 
With this estimate fixed, compute 6 1=6 (#,) and continue in this way. Generically, 


with 6; and 7, computed, one recomputes 


Oia = 6 (9) = (WHT) WY — OTA), (4.62) 


ei = HOk+1) = (OT) OY — WTO ey). (4.63) 


The procedure is especially transparent if the partial design matrices VY and ® are 
orthonormal: WW = 1,, b&! = J,,. Then 


6.41 = V(Y —O' fy), 
ay = OY — UTOx4)). 


In other words, having an estimate 6 of the parameter 0* one computes the residuals 
Y=Y- wT6 and then build the estimate 4 of the nuisance 9* by the empirical 
coefficients ®Y. Then this estimate 7 is used in a similar way to recompute the 
estimate of @*, and so on. 

It is worth noting that every doubled step of alternation improves the cur- 
rent value LO x, n,.). Indeed, 0,41 is defined by maximizing £(6,7,), that is, 
L(Ox+1,M) = &(Ox, H,). Similarly, £9141, 41) = S(O x41, 0,) yielding 


LO k+15 fea) = LOK. A): (4.64) 


A very interesting question is whether the procedure (4.62), (4.63) converges and 
whether it converges to the maximum likelihood solution. The answer is positive and 
in the simplest orthogonal case the result is straightforward. 


Exercise 4.9.17. Consider the orthogonal situation with &' = 0. Then the above 
procedure stabilizes in one step with the solution from Theorem 4.9.4. 


In the non-orthogonal case the situation is much more complicated. The idea is 
to show that the alternating procedure can be represented a sequence of actions of 
a shrinking linear operator to the data. The key observation behind the result is the 
following recurrent formula for ¥" 6, and 7 Ak: 
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WT Oc41 = Mo (Y — © H,) = (Me — MoM, ¥ + WoT, ¥'6x, (4.65) 


®! 4, = 1, (Y — 86.41) = (My — Tye) ¥ + 0, He O'h, (4.66) 


with Tg = ¥"(WW7) |W and 11, = &7(00") ‘6. 

Exercise 4.9.18. Show (4.65) and (4.66). 

This representation explains necessary and sufficient conditions for convergence 
of the alternating procedure. Namely, the spectral norm ||I1,1H¢||oo (the largest 


singular value) of the product operator IT, IIg should be strictly less than one, and 
similarly for II9 I,. 


Exercise 4.9.19. Show that || TTT], |loo = || Hy Me loo. 
Theorem 4.9.13. Suppose that ||], ello = A < 1. Then the alternating 


procedure converges geometrically, the limiting values 6 and i are unique and fulfill 


WTO = (1, — WoT,)! (Hs — MoH, )¥, 
o' 4 = Ti — 1,19) '(H, ~ 1, Te)Y, (4.67) 


and 6 coincides with the profile MLE 6 from (4.52). 


Proof. The convergence will be discussed below. Now we comment on the identity 
6 = 6. A direct comparison of the formulas for these two estimates can be a 
hard task. Instead we use the monotonicity property (4.64). By definition, (6, n) 
maximize globally £(0,7). If we start the procedure with 0° = 6, we would 
improve the value LO, n) at every step. By uniqueness, the procedure stabilizes 
with 6, = 6 and 4, = 7 for every k. 


Exercise 4.9.20. 1. Show by induction arguments that 
% kata 
®! Ay = AgsiY + (I, He) ®' hy, 
where the linear operator A, fulfills A; = 0 and 


c=1 
Agy1 = U1, — M1, Tg + 1,19 Ag = ) (11,9)! (1, — 1,19). 


i=0 


2. Show that Ay converges to A = (J, — I1,I1o)~'(11, — 11, M¢) and evaluate 
[A — Aglloo and |]®" (iy — IL | oe 
Hint: use that ||T1, — 1,16 loo < 1 and |[(T1, 16)’ |loo < ||, Te ll < 4’. 

3. Prove (4.67) by inserting # in place of 4, and ;,4, in (4.66). 
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4.10 Historical Remarks and Further Reading 


The least squares method goes back to Carl Gauss (around 1795 but published first 
1809) and Adrien Marie Legendre in 1805. 

The notion of linear regression was introduced by Fransis Galton around 1868 
for biological problem and then extended by Karl Pearson and Robert Fisher 
between 1912 and 1922. 

Chi-squared distribution was first described by the German statistician Friedrich 
Robert Helmert in papers of 1875/1876. The distribution was independently redis- 
covered by Karl Pearson in the context of goodness of fit. 

The Gauss—Markov theorem is attributed to GauB (1995) (originally published in 
Latin in 1821/1823) and Markoff (1912). 

Classical references for the sandwich formula (see, e.g., (4.12)) for the variance 
of the maximum likelihood estimator in a misspecified model are Huber (1967) and 
White (1982). 

It seems that the term ridge regression has first been used by Hoerl (1962); see 
also the original paper by Tikhonov (1963) for what is nowadays referred to as 
Tikhonov regularization. An early reference on penalized maximum likelihood is 
Good and Gaskins (1971) who discussed the usage of a roughness penalty. 

The original reference for the Galerkin method is Galerkin (1915). Its application 
in the context of regularization of inverse problems has been described, e.g., by 
Donoho (1995). The theory of shrinkage estimation started with the fundamental 
article by Stein (1956). 

A systematic treatment of profile maximum likelihood is provided by Murphy 
and van der Vaart (2000), but the origins of this concept can be traced back to Fisher 
(1956). 

The alternating procedure has been introduced by Dempster et al. (1977) in the 
form of the expectation-maximization algorithm. 


Chapter 5 
Bayes Estimation 


This chapter discusses the Bayes approach to parameter estimation. This approach 
differs essentially from classical parametric modeling also called the frequentist 
approach. Classical frequentist modeling assumes that the observed data Y follow a 
distribution law P from a given parametric family (P»,@ € © C R?”), that is, 


P = Py« € (Po). 


Suppose that the family (IP9) is dominated by a measure ft) and denote by p(y | 0) 
the corresponding density: 


dP» 
d [ho 


Ply |9) = (y). 


The likelihood is defined as the density at the observed point and the maximum 
likelihood approach tries to recover the true parameter 0* by maximizing this 
likelihood over 0 € ©. 

In the Bayes approach, the paradigm is changed and the true data distribution is 
not assumed to be specified by a single parameter value 6*. Instead, the unknown 
parameter is considered to be a random variable # with a distribution z on the 
parameter space © called a prior. The measure Pg» can be considered to be the 
data distribution conditioned that the randomly selected parameter is exactly 6. The 
target of analysis is not a single value @*, this value is no longer defined. Instead 
one is interested in the posterior distribution of the random parameter 3 given the 
observed data: 


what is the distribution of # given the prior 7 and the data Y? 


In other words, one aims at inferring on the distribution of 3 on the basis of the 
observed data Y and our prior knowledge 2. Below we distinguish between the 
random variable # and its particular values 8. However, one often uses the same 
symbol @ for denoting the both objects. 
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5.1 Bayes Formula 


The Bayes modeling assumptions can be put together in the form 


Y|@ ~ p(-|9), 
0 ~ m(-). 


The first line has to be understood as the conditional distribution of Y given the 
particular value @ of the random parameter #: Y | 0 means Y | 0 = 6. This section 
formalizes and states the Bayes approach in a formal mathematical way. The answer 
is given by the Bayes formula for the conditional distribution of # given Y. First 
consider the joint distribution P of Y and #. If B is a Borel set in the space Y of 
observations and A is a measurable subset of ©, then 


piBxay= | (f Pedy) mae) 


The marginal or unconditional distribution of Y is given by averaging the joint 
probability w.r.t. the distribution of #: 


P(B) = i [ Po(dy)n(d8) = [ Po(B)(d8). 


The posterior (conditional) distribution of # given the event Y ¢€ B is defined as 
the ratio of the joint and marginal probabilities: 
P(B x A) 


P@¢€ A|Ye€B)= PCB) 


Equivalently one can write this formula in terms of the related densities. In what 
follows we denote by the same letter 7 the prior measure mw and its density w.r.t. 
some other measure 1, e.g. the Lebesgue or uniform measure on ©. Then the joint 
measure P has the density 


p(y.) = ply | 0)x(0), 


while the marginal density p(y) is the integral of the joint density w.r.t. the prior z: 
piv) =f pv.6ae = | p(y | @)x(aya0. 
c) (c) 


Finally the posterior (conditional) density p(0 | y) of # given y is defined as the 
ratio of the joint density p(y, @) and the marginal density p(y): 
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p(y.) _ ply | @) (8) 
PY) fg P(y | O) (8) d 0 


PO|y)= 


Our definitions are summarized in the next lines: 


Y|0 ~ p(y |), 
v~ n(9), 


Y ~ p(y) = [xo | )(0)d 8, 


pY.0) _ ——— p(¥ | 0)(@) 
PY) = fo p(X | 0)x(0)d0 


b|Y ~ p(@|Y)= (5.1) 


Note that given the prior z and the observations Y, the posterior density p(@ | Y) is 
uniquely defined and can be viewed as the solution or target of analysis within the 
Bayes approach. The expression (5.1) for the posterior density is called the Bayes 
formula. 

The value p(y) of the marginal density of Y at y does not depend on the 
parameter @. Given the data Y, it is just a numeric normalizing factor. Often one 
skips this factor writing 


v | Y x p(Y | 0)(0). 


Below we consider a couple of examples. 


Example 5.1.1. LetY = (%,..., us be a sequence of zeros and ones considered 
to be a realization of a Bernoulli experiment for n = 10. Let also the underlying 
parameter @ be random and let it take the values 1/2 or 1 each with probability 1/2, 
that is, 


m(1/2) = m(1) = 1/2. 


Then the probability of observing y = “10 ones” is 
1 1 
P(y) = SP | # = 1/2)+ 5PYy |o =). 


The first probability is quite small, it is 2~'°, while the second one is just one. 
Therefore, P(y) = (27!°+1)/2. If we observed y = (1,..., 1) ', then the posterior 
probability of 7 = 1 is 


PYy|¢=)PO=1 1 
P(y) Pie oe 


P(’ =1|y) = 


that is, it is quite close to one. 
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Exercise 5.1.1. Consider the Bernoulli experiment Y = (%,..., aes with n = 
10 and let 


m(1/2) = 2(0.9) = 1/2. 


Compute the posterior distribution of 3 if we observe y = (y1,...,n)! with 


* y=(l,..., 1)! 
e the number of successes S = yj +... + y, 18 5. 


Show that the posterior density p(@ | y) only depends on the numbers of suc- 
cesses S. 


5.2 Conjugated Priors 


Let (P¢) be a dominated parametric family with the density function p(y | 0). Fora 
prior z with the density 2 (6), the posterior density is proportional to p(y | 0)2(0). 
Now consider the case when the prior z belongs to some other parametric family 
indexed by a parameter a, that is, 7(0) = 2(0,a). A very desirable feature of 
the Bayes approach is that the posterior density also belongs to this family. Then 
computing the posterior is equivalent to fixing the related parameter a = a(Y). 
Such priors are usually called conjugated. 
To illustrate this notion, we present some examples. 


Example 5.2.1 (Gaussian Shift). Let Y; ~ N(@,07) with o known, i = 1,...,n. 
Consider 3 ~ N(t, g?), @ = (t, g7). Then for y = (y1,...,yn)' € R" 


p(y | )x (0,0) = C exp{— } \(y; — 6)°/(20”) — (6 — )°/(2g)}, 


where the normalizing factor C does not depend on 6 and y. The expression in the 
exponent is a quadratic form of 6 and the Taylor expansion w.r.t. 6 at 9 = t implies 


i-t? O- @—rtry _ _ 
9|¥ cexpl- > 9) + FS Yo-9- = (o*+¢8 aye 


This representation indicates that the conditional distribution of 6 given Y is normal. 
The parameters of the posterior will be computed in the next section. 


Example 5.2.2 (Bernoulli). Let Y; be a Bernoullir.v. with P(Y; = 1) = @. Then for 
y = (),,---.yn)', it holds p(y | 6) = []j_, 0% (1 — 6)'”. Consider the family 
of priors from the Beta-distribution: 7(0,a@) = (a)07(1 — 6)? for « = (a,b), 
where (0) is a normalizing constant. It follows 


d|y x 97-9) | [0% — 6) = ote — ays 
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fors = yj +...+ yy. Obviously given s this is again a distribution from the 
Beta-family. 


Example 5.2.3 (Exponential). Let Y; be an exponential r.v. with P(Y; > y) = 
e°. Then p(y | @): = [2 8e>" °. For the family of priors from the Gamma- 


distribution with 2(0,0) = C(w)6%e~” for « = (a,b). One has for the vector 
Y=(O-- In)" 


n 
3 | ya 64e% I] Ge i9 = Q2taq—b+b) 
i=l 
which yields that the posterior is Gamma with the parameters n + a ands + bD. 


All the previous examples can be systematically treated as special case an 
exponential family. Let Y = (%,..., Y,) be an i.i.d. sample from a univariate EF 
(Po) with the density 


pily | 9) = pi(y) exp{yC(@) — B(A)} 


for some fixed functions C(@) and B(@). For y = (y1,..-.¥n)! 


at the point y is given by 


, the joint density 


Diy | 0) x exp{sC(6) — nB(6)}. 


This suggests to take a prior from the family 2(0,a@) = (a) exp{a C(6) - bB(O)\ 
with « = (a,b)'. This yields for the posterior density 


d| y x ply |)2(O, 0) « exp{(s + a)C(9) — (n + b) B(O)} 


which is from the same family with the new parameters w(Y) = (S +a,n +b). 
Exercise 5.2.1. Build a conjugate prior the Poisson family. 


Exercise 5.2.2. Build a conjugate prior the variance of the normal family with the 
mean zero and unknown variance. 


5.3. Linear Gaussian Model and Gaussian Priors 


An interesting and important class of prior distributions is given by Gaussian priors. 
The very nice and desirable feature of this class is that the posterior distribution for 
the Gaussian model and Gaussian prior is also Gaussian. 
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5.3.1 Univariate Case 


We start with the case of a univariate parameter and one observation Y ~ N(6,07), 
where the variance o* is known and only the mean @ is unknown. The Bayes 


approach suggests to treat 6 as a random variable. Suppose that the prior z is also 


normal with mean t and variance r?. 


Theorem 5.3.1. Let Y ~ N(6,07), and let the prior x be the normal distribution 
N(e, 77): 
Y|0~ N(6,07), 
0 ~ N(t,r’). 


Then the joint, marginal, and posterior distributions are normal as well. Moreover, 
it holds 


Y ~N(t, o7 + r’), 
to? + Yr? o?r? 
o2+r2 'o24 72° 


o|y ~x( 


Proof. Itholds Y = } + ¢ with } ~ N(t,r7) and e ~ N(O, 07) independent of #. 
Therefore, Y is normal with mean EY = Ed + Ee = 7 and the variance is 


Var(Y) = E(Y — 1)? =r? +07. 
This implies the formula for the marginal density p(Y). Next, for p = 07/(r?+07), 
E[(o —t)(Y -— t)| = Ko — 1)? =r? = (1— p) Var(Y). 
Thus, the random variables Y — t and ¢ with 
6 =0=¢-—(1=p)Y —1) = p@ =7) —(1— pie 


are Gaussian and uncorrelated and therefore, independent. The conditional distri- 
bution of ¢ given Y coincides with the unconditional distribution and hence, it is 
normal with mean zero and variance 


o2r2 


Var(£) = p” Var(#) + (1 — p)? Var(e) = p’r? + (1 — p)’o? = ———.. 
a a a 
This yields the result because 3} = €+ pt+(1—p)Y. 


Exercise 5.3.1. Check the result of Theorem 5.3.1 by direct calculation using 
Bayes formula (5.1). 
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Now consider the i.i.d. model from (0, 0?) where the variance o” is known. 
Theorem 5.3.2. Let Y = (Y%,....Y,)! be iid. and for each Y; 
¥;|0 ~ N(6,07), (5.2) 
o ~ N(t,r’). (5.3) 


Then for Y = (Yit...+Yn)/n 


|¥ ~N rol/n+Y¥r* r?o7/n . 
r+o07/n 'r?2+07/n 


So, the posterior mean of % is a weighted average of the prior mean t and the 
sample estimate Y ; the sample estimate is pulled back (or shrunk) toward the prior 
mean. Moreover, the weight p on the prior mean is close to one if o? is large relative 
tor? (i.e., our prior knowledge is more precise than the data information), producing 
substantial shrinkage. If o7 is small (i.e., our prior knowledge is imprecise relative 
to the data information), ¢ is close to zero and the direct estimate Y is moved very 
little towards the prior mean. 


Exercise 5.3.2. Prove Theorem 5.3.2 using the technique of the proof of Theo- 
rem 5.3.1. 

Hint: consider Y; = # + ¢;, Y = S/n, and define € = # —t — (1 — p)(¥ — 1). 
Check that ¢ and Y are uncorrelated and hence, independent. 


The result of Theorem 5.3.2 can formally be derived from Theorem 5.3.1 
by replacing n i.i.d. observations Y;,..., Y, with one single observation Y with 
conditional mean 6 and variance o7/n. 


5.3.2 Linear Gaussian Model and Gaussian Prior 


Now we consider the general case when both Y and # are vectors. Namely we 
consider the linear model Y = W'# + e€ with Gaussian errors ¢ in which the 
random parameter vector # is multivariate normal as well: 


v ~N(t,T), Y|o~N(w'O,d). (5.4) 


Here W is a given p x n design matrix, and & is a given error covariance matrix. 
Below we assume that both & and I are non-degenerate. The model (5.4) can be 
represented in the form 
B=T+6, § ~ NOT), (5.5) 
Y=W'r+wWléste, e~N(0O,E), e Lé, (5.6) 
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where & | e means independence of the error vectors & and e. This representation 


makes clear that the vectors 3, Y are jointly normal. Now we state the result about 
the conditional distribution of # given Y. 


Theorem 5.3.3. Assume (5.4). Then the joint distribution of 3}, Y is normal with 


7. v T % v ia ee 

= — ar => - 
Y Wir Y rwwrw+>d 

Moreover, the posterior 3 | Y is also normal. With B = T+ wy! wT, 


E(o |¥) =7+rw(wTrw+ 5) '(Y- wr) 


= B'r174+B'wy'y, (5.7) 

Var(o | ¥) = Bo. (5.8) 

Proof. The following technical lemma explains a very important property of the 
normal law: conditional normal is again normal. 


Lemma 5.3.1. Let & and n be jointly normal. Denote U = Var(&), W = Var(y), 
C = Cov(é,9) = E(é — E&)(y — En)'. Then the conditional distribution of & 


given n is also normal with 


E[é | 9] = E& + CW"'(y - En), 
Var[é | 9] =U —cwiic!. 
Proof. First consider the case when & and y are zero-mean. Then the vector 


def 


¢=§-CW'y 


is also normal zero mean and fulfills 


E(¢n') = E[(é — CW'n)n' |] = E(En') — CW E(yy') = 0, 
Var($) = E[(é — CW7'n)(& — CW"'n)" | 
= E[(é-—Cw'né’ =U-—cw'c!. 


The vectors ¢ and y are jointly normal and uncorrelated, thus, independent. This 
means that the conditional distribution of € given 7 coincides with the unconditional 
one. It remains to note that the € = ¢ + CW 'y. Hence, conditionally on 7, the 
vector & is just a shift of the normal vector ¢ by a fixed vector CW !. Therefore, the 
conditional distribution of & given 9 is normal with mean CW~!y and the variance 
Var(¢) = U—Cw'c!. 
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Exercise 5.3.3. Extend the proof of Lemma 5.3.1 to the case when the vectors & 
and 7 are not zero mean. 


It remains to deduce the desired result about posterior distribution from this 
lemma. The formulas for the first two moments of # and Y follow directly 
from (5.5) and (5.6). Now we apply Lemma 5.3.1 wih U = T,C = Ty, 
W = W'TW + »®. It follows that the vector # conditioned on Y is normal with 

E(v|Y¥)=1+rww (yy —w'r) (5.9) 


Var(o |¥Y) =P-Pwewelw'T. 
Now we apply the following identity: for any p x n-matrix A 


A(I, + A'A) =(I,+AA') A. (5.10) 


The latter implies with A = T'!/?Wy~!/? that 
P-rew'wir =P'7t7, — A(t, + ATA) ATP? 
— r?r, #AA") Tr? - (ir ane wotwl)y! = Bo 
Similarly (5.10) yields with the same A 


rww =P'?a(1, + ATA) SO? 
=T'/2(1,+AA") ADU? = Bows, 


This implies (5.7) by (5.9). 
Exercise 5.3.4. Check the details of the proof of Theorem 5.3.3. 


Exercise 5.3.5. Derive the result of Theorem 5.3.3 by direct computation of the 
density of } given Y. 

Hint: use that # and Y are jointly normal vectors. Consider their joint density 
p(6@,Y) for Y fixed and obtain the conditional density by analyzing its linear and 
quadratic terms w.r.t. 0. 


Exercise 5.3.6. Show that Var(# | Y) < Var(#) = I. 
Hint: use that Var(# | ¥) = B~! and BP! 4+ WS-!wT s rol, 


The last exercise delivers an important message: the variance of the posterior is 
smaller than the variance of the prior. This is intuitively clear because the posterior 
utilizes the both sources of information: those contained in the prior and those 
we get from the data Y. However, even in the simple Gaussian case, the proof is 
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quite complicated. Another interpretation of this fact will be given later: the Bayes 
approach effectively performs a kind of regularization and thus, leads to a reduction 
of the variance; cf. Sect. 4.7. Another conclusion from the formulas (5.7), (5.8) is 
that the moments of the posterior distribution approach the moments of the MLE 


6 = (VET) wey as I’ grows. 


5.3.3 Homogeneous Errors, Orthogonal Design 


Consider a linear model Y; = wT 0+ 6; fori = 1,...,n, where W; are given 
vectors in R? and ¢; are ii.d. normal N(0,07). This model is a special case of 
the model (5.4) with W = (W,...,W,,) and uncorrelated homogeneous errors e¢ 
yielding © = o7J,.Then ©! =o~7],, B=T'+07?uw! 


E(0 |Y) = BIT 't +0 7B UY, (5.11) 
Var(# | Y)= Bo, 


where WW! = ~, W; W,". If the prior variance is also homogeneous, that is, [ = 
r?1,, then the formulas can be further simplified. In particular, 


Var(# | Y)= (fT, + owt), 


The most transparent case corresponds to the orthogonal design with WW! = 7°] p 
for some 7? > 0. Then 


_ = oie 1 
E(# | Y) pa ee Y, (5.12) 
o2 
Var(d | Y) = Pa ozpalr (5.13) 


Exercise 5.3.7. Derive (5.12) and (5.13) from Theorem 5.3.3 with © = o7/,, T° = 
r?T,,and UW! = J, 


Exercise 5.3.8. Show that the posterior mean is the convex combination of the 
MLE 0 = n-7WY and the prior mean T: 


E(o |Y) =pr+(1—p)6, 


with p = (07/r*)/(7? + 07/r?). Moreover, p > 0 as 7 — 00, that is, the posterior 
mean approaches the MLE 0. 
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5.4 Non-informative Priors 


The Bayes approach requires to fix a prior distribution on the values of the parameter 
3. What happens if no such information is available? Is the Bayes approach still 
applicable? An immediate answer is “no,” however it is a bit hasty. Actually one 
can still apply the Bayes approach with the priors which do not give any preference 
to one point against the others. Such priors are called non-informative. Consider 
first the case when the set © is finite: © = {0,,..., 0.7}. Then the non-informative 
prior is just the uniform measure on © giving to every point 0, the equal probability 
1/M. Then the joint probability of Y and # is the average of the measures P»,, and 
the same holds for the marginal distribution of the data: 


1 M 
PO) = FD POY | On). 


m=1 
The posterior distribution is already “informative” and it differs from the uniform 
prior: 
P(y|Ox)r(Ox) p(y | Ox) 
p(y) yi, PCY | Om) 


P(Ox|y) = k=1,...,M. 


Exercise 5.4.1. Check that the posterior measure is non-informative iff all the 
measures Pg, coincide. 


A similar situation arises if the set © is a non-discrete bounded subset in R?. A 
typical example is given by the case of a univariate parameter restricted to a finite 
interval [a, b]. Define 2(@) = 1/2(@), where 


o£ | ao. 
x(®) i 
Then 
1 
a 0)d0. 
p(y) we | P0| ) 


p(y|@)x() p(y |) 
p(y) Se p(y |0)d0 
In some cases the non-informative uniform prior can be used even for unbounded 


parameter sets. Indeed, what we really need is that the integrals in the denominator 
of the last formula are finite: 


pO|y) = (5.14) 


[0 \0)d9 <0 Vy. 
() 


Then we can apply (5.14) even if © is unbounded. 
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Exercise 5.4.2. Consider the Gaussian Shift model (5.2) and (5.3). 


(i) Check that for n = 1, the value fen PY | 0)d@ is finite for every y and the 
posterior distribution of 3 coincides with the distribution of Y. 
(ii) Compute the posterior for n > 1. 


Exercise 5.4.3. Consider the Gaussian regression model Y = W'S + e, 
e ~ N(O, x), and the non-informative prior a which is the Lebesgue measure 
on the space IR?. Show that the posterior for # is normal with mean 6 = 
(Wx! w')-!Ws-!Y and variance (YD~!W"')—!, Compare with the result of 
Theorem 5.3.3. 


Note that the result of this exercise can be formally derived from Theorem 5.3.3 by 
replacing ~! with 0. 

Another way of tackling the case of an unbounded parameter set is to consider a 
sequence of priors that approaches the uniform distribution on the whole parameter 
set. In the case of linear Gaussian models and normal priors, a natural way is to let 
the prior variance tend to infinity. Consider first the univariate case; see Sect. 5.3.1. 
A non-informative prior can be approximated by the normal distribution with mean 
zero and variance r? tending to infinity. Then 


Yr? o*r 


o? +r?’ 0? +r? 


oly ~n ) nono r > 00. 


It is interesting to note that the case of an i.i.d. sample in fact reduces the situation 
to the case of a non-informative prior. Indeed, the result of Theorem 5.3.3 implies 
with r? = nr? 


Gt ate 


Yr2 2.2) 
b|y~a( Pe rt). 


One says that the prior information “washes out” from the posterior distribution as 
the sample size 7 tends to infinity. 


5.5 Bayes Estimate and Posterior Mean 


Given a loss function (0, 6’) on © x O, the Bayes risk of an estimate 6 = 6(Y) 
is defined as 


def 


R@) # Be@.0) = [ (f (8(9).6) pLy | ® mold) 1(0)d0. 


Note that 3 in this formula is treated as a random variable that follows the prior 
distribution z. One can represent this formula symbolically in the form 
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R, (8) = E[E(p(6, 0) |0)] = ERO, 8). 

Here the external integration averages the pointwise risk RO, 0) over all possible 

values of 3 due to the prior distribution. 


The Bayes formula p(y | 0)x(0) = p(@ | y)p(y) and change of order of 
integration can be used to represent the Bayes risk via the posterior density: 


xO) = [ (609.9) 08 |») 40) pry moldy) 
= E[E{e(6,0)|Y}]. 
The estimate 6, is called Bayes or m-Bayes if it minimizes the corresponding risk: 


6,, = argminR, (6), 
6 


where the infimum is taken over the class of all feasible estimates. The most 
widespread choice of the loss function is the quadratic one: 


(6, 6') = |\6 — 6". 


The great advantage of this choice is that the Bayes solution can be given explicitly; 
it is the posterior mean 


6, “g@ |¥) = [ 6 p(0|Y) do. 
2) 


Note that due to Bayes’ formula, this value can be rewritten 


Se 
gq 
| 


aa 
= —— | 6 p(Y | 6)2(0)d0 
P(Y) Jo | 
oY) =f per |e) mea. 
Theorem 5.5.1. It holds for any estimate 6 


Rr (0) > Rr (Ox). 


Proof. The main feature of the posterior mean is that it provides a kind of projection 
of the data. This property can be formalized as follows: 


E@.-#|¥)= [ (6. ~8) p@|¥)d0 =0 
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yielding for any estimate 6 = 6(Y) 
E(\|@ — 8)? |) 


6, —8|) | ¥) + E(|6, —6||?| ¥) + 2(6 — 6,)E(6, — 0 |Y) 


II 


o 
E( 
E(}6. — 8) | ¥) + E(]6, — 6? | Y) 
E(\|6, —# ||" | ¥). 


IV 


Here we have used that both 6 and 6, are functions of Y and can be considered as 
constants when taking the conditional expectation w.r.t. Y. Now 


R,(6) = El — 9? = E[H((|6 — 9 |? | ¥)] 
> E[E (6. — 9° |¥)] = RG.) 


and the result follows. 


Exercise 5.5.1. Consider the univariate case with the loss function |6 — 6’|. Check 
that the posterior median minimizes the Bayes risk. 


5.6 Posterior Mean and Ridge Regression 


Here we again consider the case of a linear Gaussian model 
Y=W'0+e, e~N(,o7l,). 


(To simplify the presentation, we focus here on the case of homogeneous errors with 
y = 07 1,.) Recall that the maximum likelihood estimate @ for this model reads as 


6 = (wT) ‘wy. 


A penalized MLE 6 for the roughness penalty ||G6 ||? is given by 


6¢ =(WW" +0°G’) WY; 

see Sect. 4.7. It turns out that a similar estimate appears in quite a natural way 
within the Bayes approach. Consider the normal prior distribution 8 ~ N(0, G~’). 
The posterior will be normal as well with the posterior mean: 


6, =o? BY =(WW" 4.0°G?) ‘UY; 


see (5.11). It appears that 6, = 9g for the normal prior z = N(0, G~’). 
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One can say that the Bayes approach with a Gaussian prior leads to a regu- 
larization of the least squares method which is similar to quadratic penalization. 
The degree of regularization is inversely proportional to the variance of the prior. 
The larger the variance, the closer the prior is to the non-informative one and the 
posterior mean 6,, to the MLE 6. 


5.7 Bayes and Minimax Risks 


Consider the parametric model Y ~ P € (P9,8 © OC R?). Let 6 be an estimate 
of the parameter 3 from the available data Y. Formally 6 is a measurable function 
of Y with values in ©: 


The quality of estimation is assigned by the loss function 9(@, 0’). In estimation 
problem one usually selects this function in the form 9(0, 6’) = 9\(@ — 0’) for 
another function @; of one argument. Typical examples are given by quadratic loss 
01(0) = ||@||? or absolute loss Q1(@) = ||@||. Given such a loss function, the 
pointwise risk of 6 at @ is defined as 


R(6,0) = Epo(6, 0). 


The minimax risk is defined as the maximum of pointwise risks over all 6 € O: 


RO) = sup RO, 0) = sup Eyo(6, 0). 
6€0 6€0 


Similarly, the Bayes risk for a prior z is defined by weighting the pointwise risks 
according to the prior distribution: 


R, (6) = RO, 0) = | Bo0(8.6)x(0)a0. 
6 


It is obvious that the Bayes risk is always smaller or equal than the minimax one, 
whatever the prior measure is: 


R, (8) < RB) 


The famous Le Cam theorem states that the minimax risk can be recovered by taking 
the maximum over all priors: 


R(O) = sup R, (8). 


Moreover, the maximum can be taken over all discrete priors with finite supports. 
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5.8 Van Trees Inequality 


The Cramér—Rao inequality yields a low bounds of the risk for any unbiased 
estimator. However, it appears to be sub-optimal if the condition of no-bias is 
dropped. Another way to get a general bound on the quadratic risk of any estimator 
is to bound from below a Bayes risk for any suitable prior and then to maximize this 
lower bound in a class of all such priors. 

Let Y be an observed vector in IR” and (P¢) be the corresponding parametric 
family with the density function p(y | 0) w.r.t. a measure fly on R”. Let also 6 = 


6(Y) be an arbitrary estimator of @. For any prior 2, we aim to lower bound the 
m-Bayes risk 8, (0) of 6. We already know that this risk minimizes by the posterior 
mean estimator 6. However, the presented bound does not rely on a particular 
structure of the considered estimator. Similarly to the Cramér—Rao inequality, it is 
entirely based on some geometric properties of the log-likelihood function p(y | 0). 

To simplify the explanation, consider first the case of a univariate parameter 
6 € © C R. Below we assume that the prior z has a positive continuously 
differentiable density (6). In addition we suppose that the parameter set © is an 
interval, probably infinite, and the prior density 2(6) vanishes at the edges of ©. By 
F,, we denote the Fisher information for the prior distribution zr: 


def |’(0)|° 
- @ (9) 


Remind also the definition of the full Fisher information for the family (P¢): 


0 2 
"= folate 


F(0) * By < 


These quantities will enter in the risk bound. In what follows we also use the 
notation 


Px(y.9) = p(y | 9)x(4), 
def APx(y, 9) 
Ply 8) = ig 


The Bayesian analog of the score function is 


P det D(Y,0) — p(Y |9) (0) 
BQ) (V8) po) x8)" 


(5.15) 


The use of px(y,0) = 0 at the edges of © implies for any 6= 6(y) and any 
ye R’" 
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[eo — 8) px(y,6)}/40 = | {8(y) — 8} paly,4)] = 0, 


and hence 


[160 8} pp0v.0) 40 = ff pay. a8. 5.16) 
(0) (2) 


This is an interesting identity. It holds for each y with fy-probability one and the 


estimate 9 only appears in its left-hand side. This can be explained by the fact 
that {4 p.(y.@)d@ = 0 which follows by the same calculus. Based on (5.16), 


one can compute the expectation of the product of AY )-v and L'(Y,0) = 
DY. 9)/px(¥, 9): 


Ex| {0(Y) - O}L,.(¥.9)] = J, [60-9 rir. 040 d poly) 
=f [pay deduoy)=1. 17) 
R" JO 


Again, the remarkable feature of this equality is that the estimate 6 only appears in 
the left-hand side. Now the idea of the obtained bound is very simple. We introduce 
anv. hA(Y,0) = L' (¥,0)/Ex|Li (Y,9)|? and use orthogonality of 6(Y) —o- 
h(Y,v) and A(Y, 2%) and the Pythagoras Theorem to show that the squared risk of 
6 is not smaller than E, [h7(Y, d)]. More precisely, denote 


7, 2 Ey | {L(Y 8} ], 


def 


WY) = 7, L(Y): 
Then B,[h2(¥, 8)] = Zz! and (5.17) implies 
B,{{0) ~ 9 -h(Y, d) SAY, »)| 
= 7,E,|{6(%) - 9}L, (7, 9) | - Ex[°(, 9)] = 0 
and 
B,[(6() — 9)"| = Ex [{007) - 9 — AY, ») + A, )}"| 
= E,[{607) - 9 - HY, )}"] + B,[°Y, 8) 


4 2E,| {6(¥) ~ 9 -h(Y, dv) AY, »)| 
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= By[{(Y) _~o-nY, »)}"] +E, [h2(¥,9)] 
SEY...) | Sz: 


It remains to compute Z,. We use the representation (5.15) for L’ (Y, 6). Further 
we use the identity f p(y | 0)d fto(y) = 1 which implies by differentiating in 0 


[ro | 9) duo(y) = 0. 


This yields that two random variables p’(Y | 0)/p(Y | 0) and x'(0)/(@) are 
orthogonal under the measure P,,: 


E,| p'(¥ | 0) x'(0) 


BO] HO} Lode. 2219 Oda olny a 


-[if p(y | ® dmo(y)| x'(@)d0 =o. 
ol JR" 


Now by the Pythagoras Theorem 


p(y |d))? '(9))? 
“art * P| aa 


7 reo) {=o 
= Ebel aT bee 


= i F(0) 1(0) d0 + Fy. 
(3) 


E, {Li (Y, 0} = E,| 


Now we can summarize the derivations in the form of van Trees’ inequality. 


Theorem 5.8.1 (van Trees). Let © be an interval on R and let the prior density 
wt(@) have piecewise continuous first derivative, 1(@) be positive in the interior of 
© and vanish at the edges. Then for any estimator 0 of 0 


1 
E,(6 — 8)? > (/ F(0) 2(0) dO + Fr] 
(0) 


Now we consider a multivariate extension. We will say that a real function g(@), 
0 € O, is nice if it is piecewise continuously differentiable in 0 for almost all values 
of @. Everywhere g’(-) means the derivative w.r.t. 6, that is, g’(0) = 4 2(0). 

We consider the following assumptions: 


1. p(y | @) is nice in @ for almost all y; 
2. The full Fisher information matrix 
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U ! ! a 
(0) 2 Vary PL = Beer 
PY | 6) p(y |6) | pv |4) 


exists and continuous in 0; 

3. © is compact with boundary which is piecewise C)-smooth; 

4. 1(@) is nice; 2(@) is positive on the interior of © and zero on its boundary. The 
Fisher information matrix F, of the prior z is positive and finite: 


act, [2/@){ 2'() 7 
® “Ey sort |< 


Theorem 5.8.2. Let the assumptions I-4 hold. For any estimate 6 = 6(Y), it holds 


E,| {6(Y) — 9}{6(%) - 9} "| = Ty, (5.18) 


where 


def 


i, = i; F(0)2(0)d0+F,. 
e 


Proof. The use of pz(y, 8) = ply | 0)2(@) = 0 at the boundary of © implies for 
any 6=6 (y) and any y € R” by Stokes’ theorem 


[(160)- 9} pay. enya =0, 
and hence 
[rst 0480) — 0)" a0 =i] Pry. 0)d8, 
2) 


é see hi . def 
where I, is the identity matrix. Therefore, the random vector L' (Y,6) = 


D(Y,0)/px(¥, @) fulfills 
! * Ty : A tT 


=i) | p=ty.8) 40 duly) = Ip. 
R’ JO 
(5.19) 


Denote 
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7, = Be [E, HILLY, ay", 
n(Y 8) ST L(Y, 8). 
Then E,[A(Y, 3) {A(Y, o)}7] = T=! and (5.19) implies 
B,[ AY, 9){6(%) — t —h(y,0)}"| 
=7 Ee. [L., v){6(Y) - o}"] = B,[A(Y, a){A(Y, "| =0 

and hence 
E,|{6(Y) — 9} {(r) - 9} "| 

= E,|{0(Y) — 0 —A(Y,9) +A(Y, —}{O(Y)— 9 —h(Y, BW) + AY, "| 

ws E,[{6(Y) ~ 9-H, 0) O(Y) —9 — Hy. 9)}"] 

+ By[{h(¥, BD} {HY 9)} "| 


> E,| {0} {AY #)}"] = Zz". 


It remains to compute Z,,. The definition implies 


def 1 : 


! del ! 1 ! 
Li (Y,0) = ay Pil 8) p'(Y | 0) + ——7'(). 


1 
 —p(¥ | 8) (0) 


Further, the identity f p(y | 0)dto(y) = 1 yields by differentiating in 0 
[ro | 9) dpo(y) = 0. 


Using this identity, we show that the random vectors p’(Y | 0)/p(v | 0) and 
(0) /2(0) are orthogonal w.r.t. the measure P.,. Indeed, 


"Y 6 6 Tr tT 
[Fria tae [Le O[f role ae 


- 
= [| [0 duo} ao =0 
) R" 


Now by usual Pythagorus calculus, we obtain 
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B,[{L,@. H}{L.(", 9} | 


/ / T ! ! Te 
= ee |e [EQ {eo 
p(y |d) | pv |e) ad) (| r(B) 


_ Parl) | Or)" 9) ea 
ea eras pY|P) + P|) la) 


= | FO) x(0)d0 +8, 
(3) 


and the result follows. 


This matrix inequality can be used for obtaining a number of Lz bounds. We 
present only two bounds for the squared norm ||6(Y) — #||. 


Corollary 5.8.1. Under the same conditions 1-4, it holds 
E,||6(Y) — O ||? = t(Zz') = p?/w(Z,). 


Proof. The first inequality follows directly from the bound (5.18) of Theorem 5.8.2. 
For the second one, it suffices to note that for any positive symmetric p x p matrix 
B, it holds 


tr(B) tr(B™!) > p?. (5.20) 


This fact can be proved by the Cauchy—Schwarz inequality. 


Exercise 5.8.1. Prove (5.20). 
Hint: use the Cauchy—Schwarz inequality for the scalar product tr(B!/? B~'/) of 


two matrices B!/? and B~'/? (considered as vectors in R””). 


5.9 Historical Remarks and Further Reading 


The origin of the Bayesian approach to statistics was the article by Bayes (1763). 
Further theoretical foundations are due to de Finetti (1937), Savage (1954), and 
Jeffreys (1957). 

The theory of conjugated priors was developed by Raiffa and Schlaifer (1961). 
Conjugated priors for exponential families have been characterized by Diaconis and 
Ylvisaker (1979). Non-informative priors were considered by Jeffreys (1961). 

Bayes optimality of the posterior mean estimator under quadratic loss is a 
classical result which can be found, for instance, in Sect. 4.4.2 of Berger (1985). 


194 5 Bayes Estimation 


The van Trees inequality is originally due to Van Trees (1968), p. 72. Gill and 
Levit (1995) applied it to the problem of establishing a Bayesian version of the 
Cramér—Rao bound. 

For further reading, we recommend the books by Berger (1985), Bernardo and 
Smith (1994), and Robert (2001). 


Chapter 6 
Testing a Statistical Hypothesis 


Let Y be the observed sample. The hypothesis testing problem assumes that there 
is some external information (hypothesis) about the distribution of this sample and 
the target is to check this hypothesis on the basis of the available data. 


6.1 Testing Problem 


This section specifies the main notions of the theory of hypothesis testing. We start 
with a simple hypothesis. Afterwards a composite hypothesis will be discussed. We 
also introduce the notions of the testing error, level, power, etc. 


6.1.1 Simple Hypothesis 


The classical testing problem consists in checking a specific hypothesis that the 
available data indeed follow an externally precisely given distribution. We illustrate 
this notion by several examples. 


Example 6.1.1 (Simple Game). Let Y = (Y\,...,Y,)' be a Bernoulli sequence of 
zeros and ones. This sequence can be viewed as the sequence of successes, or results 
of throwing a coin, etc. The hypothesis about this sequence is that wins (associated 
with one) and losses (associated with zero) are equally frequent in the long run. This 
hypothesis can be formalized as follows: P = (Pg+)®” with 6* = 1/2, where Po 
describes the Bernoulli experiment with parameter 0. 


Example 6.1.2 (No Treatment Effect). Let (Y;, Y;) be experimental results, i = 
1,...,n. The linear regression model assumes a certain dependence of the form 
Y; = Va 0 + e; with errors ¢; having zero mean. The “no effect” hypothesis means 
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that there is no systematic dependence of Y; on the factors Wj, i.e.,0 = 6* = 0, 
and the observations Y; are just noise. 


Example 6.1.3 (Quality Control). Assume that the Y; are the results of a production 
process which can be represented in the form Y; = 0* + €;, where @* is a nominal 
value and ¢; is a measurement error. The hypothesis is that the observed process 
indeed follows this model. 


The general problem of testing a simple hypothesis is stated as follows: to check 
on the basis of the available observations Y that their distribution is described by a 
given measure P. The hypothesis is often called a null hypothesis or just null. 


6.1.2 Composite Hypothesis 


More generally, one can treat the problem of testing a composite hypothesis. Let 
(P»9,6 € © C R?) bea given parametric family, and let ©) C © be a nonempty 
subset in ©. The hypothesis is that the data distribution P belongs to the set (P9, 0 € 
©). Often, this hypothesis and the subset © are identified with each other and one 
says that the hypothesis is given by ©. 

We give some typical examples where such a formulation is natural. 


Example 6.1.4 (Testing a Subvector). Assume that the vector 6 € © can be 
decomposed into two parts: 8 = (y,7). The subvector y is the target of analysis 
while the subvector 4 matters for the distribution of the data but is not the target of 
analysis. It is often called the nuisance parameter. The hypothesis we want to test is 
y = y* for some fixed value y*. A typical situation in factor analysis where such 
problems arise is to check on “no effect” for one particular factor in the presence of 
many different, potentially interrelated factors. 


Example 6.1.5 (Interval Testing). Let © be the real line and Qo be an interval. The 
hypothesis is that P = Py» for 6* € ©po. Such problems are typical for quality 
control or warning (monitoring) systems when the controlled parameter should be 
in the prescribed range. 


Example 6.1.6 (Testing a Hypothesis About the Error Distribution). Consider the 
regression model Y; = wT 0 + ¢;. The typical assumption about the errors ¢; is 
that they are zero-mean normal. One may be interested in testing this assumption, 
having in mind that the cases of discrete, or heavy-tailed, or heteroscedastic errors 
can also occur. 


6.1.3 Statistical Tests 


A test is a statistical decision on the basis of the available data whether the 
prespecified hypothesis is rejected or retained. So the decision space consists of 
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only two points, which we denote by zero and one. A decision ¢ is a mapping of the 
data Y to this space and is called a test: 


o:4— {0,1}. 


The event {¢@ = 1} means that the null hypothesis is rejected and the opposite event 
means non-rejection (acceptance) of the null. Usually the testing results are qualified 
in the following way: rejection of the null hypothesis means that the data are not 
consistent with the null, or, equivalently, the data contain some evidence against the 
null hypothesis. Acceptance simply means that the data do not contradict the null. 
Therefore, the term “non-rejection” is often considered more appropriate. 

The region of acceptance is a subset of the observation space Y on which @ = 0. 
One also says that this region is the set of values for which we fail to reject the null 
hypothesis. The region of rejection or critical region is, on the other hand, the subset 
of Y on which ¢ = 1. 


6.1.4 Errors of the First Kind, Test Level 


In the hypothesis testing framework one distinguishes between errors of the first 
and second kind. An error of the first kind means that the null hypothesis is falsely 
rejected when it is correct. We formalize this notion first for the case of a simple 
hypothesis and then extend it to the general case. 

Let Hp : Y ~ Pg« bea null hypothesis. The error of the first kind is the situation 
in which the data indeed follow the null, but the decision of the test is to reject this 
hypothesis: ¢ = 1. Clearly the probability of such an error is Pg+*(@ = 1). The 
latter number in [0, 1] is called the size of the test @. One says that ¢ is a test of level 
a for some a € (0, 1) if 


Pox(d = 1) <a. 


The value a is called level of the test or significance level. Often, size and level of 
a test coincide; however, especially in discrete models, it is not always possible to 
attain the significance level exactly by the chosen test ¢, meaning that the actual 
size of @ is smaller than a, see Example 6.1.7 below. 

If the hypothesis is composite, then the level of the test is the maximum rejection 
probability over the null subset po. Here, a test @ is of level a if 


sup Pe(@ = 1) <a. 
6€0o 


Example 6.1.7 (One-Sided Binomial Test). Consider again the situation from 
Example 6.1.1 (an iid. sample Y = (Y%,..., au from a Bernoulli distribution 
is observed). We let = 13 and @o = [0, 1/5]. For instance, one may want to test if 
the cancer-related mortality in a subpopulation of individuals which are exposed 
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to some environmental risk factor is not significantly larger than in the general 
population, in which it is equal to 1/5. To this end, death causes are assessed for 
13 decedents and for every decedent we get the information if s/he died because of 
cancer or not. From Sect. 2.6, we know that S,,/n efficiently estimates the success 
probability 6*, where S,, = y , Yj. Therefore, it appears natural to use S;, also 
as the basis for a solution of the test problem. Under 6*, it holds S,, ~ Bin(n, 0*). 
Since a large value of S,, implies evidence against the null, we choose a test of the 
form @ = 1c, .n)(Sn), where the constant cy has to be chosen such that ¢ is of level 
a. This condition can equivalently be expressed as 


Po(Sn < cy) = 1—a@. 


inf 
0<0<1/5 
For fixed k € {0,...,”}, we have 
EK (n 
Po(Sn <k) = So | O° - 8)" = F(6,k) (say). 
£=0 é 


Exercise 6.1.1. Show that for all k € {0,...,m}, the function F(-, k) is decreasing 
on @p = [0, 1/5]. 


Due to Exercise 6.1.1, we have to calculate cy under the least favorable 
parameter configuration (LFC) 6 = 1/5 and we obtain 


ce = in Om: »(i) (3) (2) 21-44 , 


because we want to exhaust the significance level a as tightly as possible. For the 
standard choice of a = 0.05, we have 


ie (2) ‘\ 0.901 3 a ei 0.9700 
Heke a el (3) (5 i 


£=0 


Me. 


£ 


ll 


hence, we choose cy = 5. However, the size of the test @ = 1(5,13)(S,,) is strictly 
smaller than the significance level a = 0.05, namely 


sup Po (S, > 5) = Piys(Sp > 5) = 1 — Pijs(Sn < 5) © 0.03 < a. 


: es 
0<6<1/5 


6.1.5 Randomized Tests 


In some situations it is difficult to decide about acceptance or rejection of the 
hypothesis. A randomized test can be viewed as a weighted decision: with a certain 
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probability the hypothesis is rejected, otherwise retained. The decision space for a 
randomized test ¢ is the unit interval [0, 1], that is, ¢(Y) is a number between zero 
and one. The hypothesis Ho is rejected with probability ¢(Y) on the basis of the 
observed data Y. If @(Y) only admits the binary values 0 and | for every Y, then 
we are back at the usual non-randomized test that we have considered before. The 
probability of an error of the first kind is naturally given by the value E@(Y). For a 
simple hypothesis Ho : P = P»+, a test @ is now of level a if 


E@(Y) =a. 


For a randomized test ¢, the significance level a is typically attainable exactly, even 
for discrete models. In the case of a composite hypothesis Hp : P € (P9,0 € Qo), 
the level condition reads as 


sup E¢(Y) < a 
6€Oo 


as before. In what follows we mostly consider non-randomized tests and only 
comment on whether a randomization can be useful. Note that any randomized test 
can be reduced to a non-randomized test by extending the probability space. 


Exercise 6.1.2. Construct for any randomized test @ its non-randomized version 
using a random data generator. 


Randomized tests are a satisfactory solution of the test problem from a mathe- 
matical point of view, but they are disliked by practitioners, because the test result 
may not be reproducible, due to randomization. 


Example 6.1.8 (Example 6.1.7 Continued). Under the setup of Example 6.1.7, 
consider the randomized test ¢, given by 


0, S, <5 
o(Y) = 2/7, Sh =5 
1, S,>5 


It is easy to show that under the LFC 6 = 1/5, the size of ¢ is (up to rounding) 
exactly equal to~ = 5%. 


6.1.6 Alternative Hypotheses, Error of the Second Kind, Power 
of a Test 


The setup of hypothesis testing is asymmetric in the sense that it focuses on 
the null hypothesis. However, for a complete analysis, one has to specify the 
data distribution when the hypothesis is false. Within the parametric framework, 
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one usually makes the assumption that the unknown data distribution belongs to 
some parametric family (P»,8 € © C R?). This assumption has to be fulfilled 
independently of whether the hypothesis is true or false. In other words, we assume 
that P € (P»9,6 € ©) and there is a subset ©) C © corresponding to the null 


hypothesis. The measure P = Pg for 6 ¢ Op is called an alternative. Furthermore, 


we call ©; = © \ ©o the alternative hypothesis. 


Now we can consider the performance of a test @ when the hypothesis Ho is 
false. The decision to retain the hypothesis when it is false is called the error of the 
second kind. The probability of such an error is equal to P(@ = 0), whenever P is 
an alternative. This value certainly depends on the alternative P = Py for 0 ¢ Qo. 
The value 6(6) = 1 — Pg (¢ = 0) is often called the test power at 0 ¢ Ovo. The 
function (6) of 8 € © \ Op given by 


def 


B(0) = 1— Po = 0) 


is called power function. Ideally one would desire to build a test which simulta- 
neously and separately minimizes the size and maximizes the power. These two 
wishes are somehow contradictory. A decrease of the size usually results in a 
decrease of the power and vice versa. Usually one imposes the level a constraint 
on the size of the test and tries to optimize its power under this constraint. Under 
the general framework of statistical decision problems as discussed in Sect. 1.4, 
one can thus regard R(¢,0) = 1 — B(@), @ € ©, as a risk function. If we 
agree on this risk measure and restrict attention to level a tests, then the test 
problem, regarded as a statistical decision problem, is already completely specified 
by (J, B(Y), (Po : 8 € ©), Oo). 


Definition 6.1.1. A test ¢* is called uniformly most powerful (UMP) test of level a 
if it is of level w and for any other test of level a, it holds 


1—Po(o* =0)>1-Pe9(@=0), 8 ¢O. 
Unfortunately, such UMP tests exist only in very few special models; otherwise, 
optimization of the power given the level is a complicated task. 
In the case of a univariate parameter 9 € © C R'! and a simple hypothesis 
6 = 6*, one often considers one-sided alternatives 
H,:0>0* or H,:6<0* 


or a two-sided alternative 


Hy: 6 #6". 
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6.2 Neyman-—Pearson Test for Two Simple Hypotheses 


This section discusses one very special case of hypothesis testing when both the 
null hypothesis and the alternative are simple one-point sets. This special situation 
by itself can be viewed as a toy problem, but it is very important from the 
methodological point of view. In particular, it introduces and justifies the so-called 
likelihood ratio test and demonstrates its efficiency. 

For simplicity we write Po for the measure corresponding to the null hypothesis 
and P, for the alternative measure. A test @ is a measurable function of the 
observations with values in the two-point set {0,1}. The event @ = 0 is treated 
as acceptance of the null hypothesis Hp while ¢ = 1 means rejection of the null 
hypothesis and, consequently, decision in favor of H. 

For ease of presentation we assume that the measure P is absolutely continuous 
w.r.t. the measure Pp and denote by Z(Y) the corresponding derivative at the 
observation point: 


et dP 
ZY) = Fe). 


Similarly L(Y) means the log-density: 


def d 


L(Y) © log Z(Y) = log 1 (¥). 
0 


dPo 


The solution of the test problem in the case of two simple hypotheses is known as 
the Neyman-Pearson test: reject the hypothesis Ho if the log-likelihood ratio L(Y) 
exceeds a specific critical value 3: 


gt S 1(L(¥) > 3) = 1(Z(Y) > e’). 


The Neyman-Pearson test is known as the one minimizing the weighted sum of the 
errors of the first and second kind. For a non-randomized test this sum is equal to 


§oPo(? = 1) + miPi(¢ = 9), 


while the weighted error of a randomized test ¢ is 


PoEod + pi (1 — 4). (6.1) 


Theorem 6.2.1. For every two positive values §9 and §, the test pF with 3 = 
log(§90/§91) minimizes (6.1) over all possible (randomized) tests : 


gx S 1(L(Y) > 3) = argmin{poEop + prlEu(1 — @)}. 
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Proof. We use the formula for a change of measure: 
E,€ = Ey[EZ(Y)] 
for any tr.v. &. It holds for any test @ with 3 = log(s90/1) 


GoEop + PiE1(1 — $) = Eo[p0¢ — 91Z(Y)4] + #1 
= —p Eo[(Z(Y) — e*)¢] + 1 
> —p Eo[Z(Y) — &]4 + 1 


with equality for ¢ = 1(L(Y) = 3). 


The Neyman—Pearson test belongs to a large class of tests of the form 


b= I(T = 3), 


where T is a function of the observations Y. This random variable is usually called 
a test statistic while the threshold 3 is called a critical value. The hypothesis is 
rejected if the test statistic exceeds the critical value. For the Neyman—Pearson test, 
the test statistic is the log-likelihood ratio L(Y) and the critical value is selected as 
a suitable quantile of this test statistic. 

The next result shows that the Neyman—Pearson test ’; with a proper critical 
value 3 can be constructed to maximize the power IE;¢ under the level constraint 
Kod <a. 


Theorem 6.2.2. Given a € (0, 1), let 34 be such that 


Po(L(Y) = 3a) = @. (6.2) 
Then it holds 
og; 2 1(L(Y) > 3a) = argmax{E)}. 
P:E9 p< 


Proof. Let ¢ satisfy Eo¢ < a. Then 


Eid — O39 < Eo{Z(Y)o} — e Eng 
= Eo[{Z(Y) — e*} 4] 
< Eo[Z(Y) — e**]4, 


again with equality for @ = 1(L(Y) = 3a). 


The previous result assumes that for a given @ there is a critical value 3, such 
that (6.2) is fulfilled. However, this is not always the case. 
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Exercise 6.2.1. Let L(Y) = logdP,(Y)/dPo. 


¢ Show that the relation (6.2) can always be fulfilled with a proper choice of 3, if 
the pdf of L(Y) under Po is a continuous function. 
¢ Suppose that the pdf of L(Y) is discontinuous and 3, fulfills 


Po(L(Y) 2 3a) >, = Po( L(V) S 3a) > 1. 


Construct a randomized test that fulfills Eg@ = a and maximizes the test power 
IE, ¢ among all such tests. 


The Neyman-—Pearson test can be viewed as a special case of the general 
likelihood ratio test. Indeed, it decides in favor of the null or the alternative by 
looking at the likelihood ratio. Informally one can say: we decide in favor of the 
alternative if it is significantly more likely at the point of observation Y. 

An interesting question that arises in relation with the Neyman—Pearson result is 
how to interpret it when the true distribution P does not coincide either with Po or 
with P, and probably it is not even within the considered parametric family (P9). 
Wald called this situation the third-kind error. It is worth mentioning that the test * 
remains meaningful: it decides which of two given measures Po and PP; describes 
the given data better. However, it is not any more a likelihood ratio test. In analogy 
with estimation theory, one can call it a quasi likelihood ratio test. 


6.2.1 Neyman—Pearson Test for an i.i.d. Sample 


Let Y = (Y,....Y,)' be an iid. sample from a measure P. Suppose that P 
belongs to some parametric family (P»,? € © C R?), that is, P = Po» for 
0* e€ ©. Let also a special point 00 be fixed. A simple null hypothesis can be 
formulated as 0* = 6. Similarly, a simple alternative is 0* = 6, for some other 
point 0; € ©. The Neyman—Pearson test situation is a bit artificial: one reduces the 
whole parameter set © to just these two points 69 and 0, and tests 0 against 01. 


As usual, the distribution of the data Y is described by the product measure Py = 


P," If 40 is a dominating measure for (Pg) and £(y, 0) “ log[d Pe (y)/d Lo], then 


the log-likelihood L(Y, @) is 
L(Y,0) “10 ey) = Le 0), 


Qn 


where fly = [44 . The log-likelihood ratio of P9, w.r.t. Pg, can be defined as 


L(Y, 01,00) = L(Y, 01) — L(Y, 60). 
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The related Neyman-—Pearson test can be written as 


o* 2 1(L(Y, 61,80) > 3). 


6.3 Likelihood Ratio Test 


This section introduces a general likelihood ratio test in the framework of parametric 
testing theory. Let, as usual, Y be the observed data, and P be their distribution. The 
parametric assumption is that P € (IP9,6@ € ©), that is, P = Pg» for @* € ©. Let 
now two subsets Qo and ©, of the set © be given. The hypothesis Ho that we would 
like to test is that P € (P»,@ € Qo), or equivalently, 9* € Oo. The alternative is 
that 0* € @). 

The general likelihood approach leads to comparing the (maximum) likelihood 
values L(Y,@) on the hypothesis and alternative sets. Namely, the hypothesis is 
rejected if there is one alternative point 6; € ©, such that the value L(Y, 6) 
exceeds all corresponding values for 8 € © by a certain amount which is defined 
by assigning losses or by fixing a significance level. In other words, rejection takes 
place if observing the sample Y under alternative Pg, is significantly more likely 
than under any measure Pg from the null. Formally this relation can be written as: 


sup L(Y,0)+3 < sup L(Y,8@), 
6€0o 0EO| 


where the constant 3 makes the term “significantly” explicit. In particular, a simple 
hypothesis means that the set ©o consists of one single point 09 and the latter 
relation takes of the form 


L(Y,00) +3 < sup L(Y,@). 
6€0; 


In general, the likelihood ratio (LR) test corresponds to the test statistic 


T © sup L(Y,@)— sup L(Y, 6). (6.3) 
6€0Q; 0E0o 


The hypothesis is rejected if this test statistic exceeds some critical value 3. Usually 
this critical value is selected to ensure the level condition: 


P(T > 3a) <a 


for a given level a, whenever P is a measure under the null hypothesis. 
We have already seen that the LR test is optimal for testing a simple hypothesis 
against a simple alternative. Later we show that this optimality property can be 
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extended to some more general situations. Now and in the following Sect. 6.4 we 
consider further examples of an LR test. 


Example 6.3.1 (Chi-Square Test for Goodness-of-Fit). Let the observation space 
(which is a subset of R!) be split into non-overlapping subsets A;,...,Aq and 
assume one observes indicators 1(Y; € A;) forl <i < nand1 < j < d. 
Define 6; = Po(Aj) = ta; Po(dy) for 1 < j < d. Let counting variables N;, 
1 < j <d,be given by N; = )-7_, 1(¥; € A;). The vector N = (Nj,..., Na)! 
follows the multinomial distribution with parameters n, d, and 0 = (6),...,0a)', 
where we assume n and d as fixed, leading to dim(@) = d — 1. More specifically, 
it holds 


d 
@= {9 = (01,...,64)" € (0, 1]¢: 5-6; = 1}. 


j=l 


The likelihood statistic for this model with respect to the counting measure is given 
by 


d 


n! N 
Z(N, p) = ——— | |". 
j= Nj! pay 


and the MLE is given by 6; = N;/n, 1 < j < d. Now, consider the point 
hypothesis 9 = p for a fixed given vector p € ©. We obtain the likelihood ratio 
statistic 


Z(N, 0) 
T = sup log ——_—_., 
eco «= Z(N, p) 


leading to 
qd , 
T =n) 9; log +, 
= Pj 


In practice, this LR test is often carried out as Pearson’s chi-square test. To this end, 
consider the function h : R — R, given by h(x) = x log(x/xo) for a fixed real 
number Xo € (0, 1). Then, the Taylor expansion of h(x) around xp is given by 


h(x) = (x — x0) + Ee — x9)* + o[(x — x0)7] as x > Xo. 
2X0 


Consequently, for 6 close to Pp, the use of > j 6 => j Pj = 1 implies 
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d 
Nj;-n, 
re eee a 
j=l 


in probability under the null hypothesis. The statistic O, given by 


(N; wan 


O a 


is called Pearson’s chi-square statistic. 


6.4 Likelihood Ratio Tests for Parameters of a Normal 
Distribution 


For all examples considered in this section, we assume that the data Y in form of an 
iid. sample (Y),..., Y,)' follow the model Y¥; = 6* + ¢; with e; ~ N(0, 07) for 
o? known or unknown. Equivalently Y; ~ N(0*,o7). The log-likelihood L(Y, @) 
(which we also denote by L(@)) reads as 


L(8) = —5 log 20?) oa 307 —6) (6.4) 


i=1 
and the log-likelihood ratio L(@, 69) = L(@) — L(Q) is given by 


L(6, 4) = 0 *[(S — n6o)(0 — %) — (0 — %)*/2] (6.5) 


with S@y,+...4Y%. 


6.4.1 Distributions Related to an i.i.d. Sample from a Normal 
Distribution 


As a preparation for the subsequent sections, we introduce here some important 
probability distributions which correspond to functions of an i.i.d. sample Y = 
(Y....,Y,)! from a normal distribution. 


Lemma 6.4.1. If Y follows the standard normal distribution on R, then Y? has the 
gamma distribution T (1/2, 1/2). 


Exercise 6.4.1. Prove Lemma 6.4.1. 
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Corollary 6.4.1. Let Y = (Y,,...,¥,)' denote ani.i.d. sample from the standard 
normal distribution on RR. Then, it holds that 


yo ¥? ~ T(1/2,n/2). 


i=1 


We call Y(1/2,n/2) the chi-square distribution with n degrees of freedom, x? for 
short. 


Proof. From Lemma 6.4.1, we have that Y? ~ T'(1/2, 1/2). Convolution stability 
of the family of gamma distributions with respect to the second parameter yields the 
assertion. 


Lemma 6.4.2. Leta,r,s > 0 nonnegative constants and X,Y independent random 
variables with X ~ T(a,r) and Y ~ IT(a,s). Then S = X + Y and 
R= X/(X +Y) are independent with S ~ T(a,r +s) and R ~ Beta(r,s). 


Exercise 6.4.2. Prove Lemma 6.4.2. 


Theorem 6.4.1. Let X),...,Xm, Y1,..-,Yn tid. with X, following the standard 
normal distribution on R. Then, the ratio 


m n 
def _ = 
Fan ee / Cae Oe as) 

i=l j=l 
has the following pdf with respect to the Lebesgue measure. 


m/2_n/2 xm/2-1 
B(m/2,n/2) (n + mx) @t)/2 


par (x) = 1.(0,00)(X). 


The distribution of Fi, is called Fisher’s F -distribution with m and n degrees of 
freedom (Sir R. A. Fisher, 1890-1962). 


Exercise 6.4.3. Prove Theorem 6.4.1. 


Corollary 6.4.2. Let X,Y\,...,Y, tid. with X following the standard normal 
distribution on R. Then, the statistic 


xX 


| n 2 
af” Dat Y; 


T= 


has the Lebesgue density 
2 


_ oti 
1 m(t) = (1+ ~) Ja BUL/2.n/2)} 


The distribution of T is called Student’s t-distribution with n degrees of freedom, ty 
for short. 
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Proof. According to Theorem 6.4.1, T? ~ F,,. Thus, due to the transformation 
formula for densities, |T| = /T? has Lebesgue density t +> 2tfin(t?), t > 0. 
Because of the symmetry of the standard normal density, also the distribution of T 
is symmetric around 0, i.e., 7 and —T have the same distribution. Hence, T has the 
Lebesgue density t + |t| fin (¢7) = T(t). 


Theorem 6.4.2 (Student 1908). 
In the Gaussian product model (R", B(IR"), (N(u, Oe i asthe) where © = 
R x (0, 00), it holds for all 0 € O: 


(a) The statistics Y , 7! y aioe “ (n — 1) °7_,% — Yn)? are 


j 
independent. 
(b) Yn ~ N(u,07/n) and (n — 1)67/0? ~ x?_,. 
(c) The statistic T,, “ Jn(Y, — )/6 is distributed as ty}. 


6.4.2 Gaussian Shift Model 


Under the measure P¢,, the variable S — 6p is normal zero-mean with the variance 
no~. This particularly implies that (S — n69)/no? is standard normal under P 4: 


(as ~n0o) | Po} = N(0, 1). 


We start with the simplest case of a simple null and simple alternative. 


6.4.2.1 Simple Null and Simple Alternative 
Let the null Hp : 6* = 6 be tested against the alternative H, : 0* = 6, for some 


fixed 6; # Oo. The log-likelihood L(6, 65) is given by (6.5) leading to the test 
Statistic 


T= oa *[(S —n@)(6; — 4) —n(O; — 6)? /2]. 
The proper critical value 3 can be selected from the condition of a-level: Pg (T > 
3a) = a. We use that the sum S — nO is under the null normal zero-mean with 
variance no”, and hence, the random variable 


— = (S —n)/Vno? 


is under 69 standard normal: € ~ N(0, 1). The level condition can be rewritten as 
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2 2 _ 
Po(#> (a — reer 3a +:0(61 — 6) /2)) = 


As & is standard normal under 6, the proper 3, can be computed as a quantile of the 
standard normal law: if zy is defined by Pa, (— > za) = a, then 


a. 0, — O|"/2] = ze 
0, — Ee + n|9, — O|7/2] = z 


or 
ba = 0 [Za 1 — Solon — n|, — |"/2].- 


It is worth noting that this value actually does not depend on p. It only depends on 
the difference |; — 99| between the null and the alternative. This is a very important 
and useful property of the normal family and is called pivotality. Another way of 
selecting the critical value 3 is given by minimizing the sum of the first and second- 
kind error probabilities. Theorem 6.2.1 leads to the choice 3 = 0, or equivalently, to 
the test 


= 1{S/n 2 (09 + 1)/2}, A 
= 1{6 = (0 + 61)/2}, AT 


90. 
6p. 


a 
z 


This test is also called the Fisher discrimination. It naturally appears in classification 
problems. 


6.4.2.2. Two-Sided Test 


Now we consider a more general situation when the simple null 6* = 6p is tested 
against the alternative 0* 4 6. Then the LR test compares the likelihood at 6 with 
the maximum likelihood over © \ {69} which clearly coincides with the maximum 
over the whole parameter set. This leads to the test statistic: 


no ~ 
T= max L(6, 00) = 35219 — O|?. 


(see Sect. 2.9), where 6 = S/n is the MLE. The LR test rejects the null if T > 3 
for a critical value 3. The value 3 can be selected from the level condition: 


Pa (T oe 3) = Py (no~?|6 _ 4|7 > 23) =a. 


Now we use that no~2|6 — o|° is yj-distributed according to Lemma 6.4.1. If 3q is 


defined by P(€* > 23.) = a for standard normal €, then the test 6 = 1(T > 3q) is 
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of level a. Again, this value does not depend on the null point 6, and the LR test is 
pivotal. 


Exercise 6.4.4. Compute the power function of the resulting two-sided test 


6 = 1(T > 3a). 


6.4.2.3. One-Sided Test 


Now we consider the problem of testing the null 6* = 6 against the one-sided 
alternative H, : 6 > 6. To apply the LR test we have to compute the maximum of 
the log-likelihood ratio L(0, @) over the set ©; = {8 > Op}. 


Exercise 6.4.5. Check that 


98 _ Ait. 225 
sup L(0,%) =. |0 —O@|7/2 if 0 = 6, 


0>6o 0 otherwise. 


Hint: if 6 > Op, then the supremum over ©, coincides with the global maximum, 
otherwise it is attained at the edge 6p. 


Now the LR test rejects the null if 6 > 4 and no~2|6 — 6o|? > 23 for a critical 
value (CV) 3. That is, 


b = 1(8 — % > o V23/n). 
The CV 3 can be again chosen by the level condition. As § = Jno — 0)/o is 


standard normal under P,,, one has to select 3 such that P(E > ./23) = a, leading 
to 


g= 1(6 > 0) + 021-a/V/n), 


where Z|, denotes the (1 — w)-quantile of the standard normal distribution. 


6.4.3 Testing the Mean When the Variance Is Unknown 


This section discusses the Gaussian shift model Y¥; = 06* + o*e; with standard 
normal errors ¢; and unknown variance o**. The log-likelihood function is still 
given by (6.4) but now o* isa part of the parameter vector. 
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6.4.3.1 Two-Sided Test Problem 
Here, we are considered with the two-sided testing problem Hp : 6* = 6 against 
A, : 6; # @. Notice that the null hypothesis is composite, because it involves the 
unknown variance o*”. 

Maximizing the log-likelihood L(@,07) under the null leads to the value 
L(6o, 65) with 


Gn “ argmax L(6),07) =n! pee — 6)’. 
As in Sect. 2.9.2 for the problem of variance estimation, it holds for any o 


L(60, 5) — L(@,07) =nK(G, 07). 


At the same time, maximizing L(6, 0”) over the alternative is equivalent to the 
global maximization leading to the value L(@, 67) with 


i 7 1 is 
6 = S/n, ie a 
The LR test statistic reads as 
T = L(6,67) — L(%, 62). 
This expression can be decomposed in the following way: 
T = L(6,67) — L(, 67) + L(, 6) — L(6, &) 


1 eo 
a ae 60)? —nK (64,67). 


In order to derive the CV 3, notice that 


exp(T) = 


exp(L(0,6?)) = ee 
exp(L (9%, &)) a2 , 


Consequently, the LR test rejects for large values of 


ay = n(6 — 6)? 
a? yr % — 4)? 


In view of Theorems 6.4.1 and 6.4.2, 3 is therefore a deterministic transformation 
of the suitable quantile from Fisher’s F-distribution with 1 and n — | degrees of 
freedom or from Student’s f-distribution with n — 1 degrees of freedom. 
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Exercise 6.4.6. Derive the explicit form of 3, for given significance level a in terms 
of Fisher’s F--distribution and in terms of Student’s ¢-distribution. 


Often one considers the case in which the variance is only estimated under the 
alternative, that is, o is used in place of do. This is quite natural because the null 
can be viewed as a particular case of the alternative. This leads to the test statistic 


T* = L(0,67) — L(O, 67) = (8 — 6)”. 
20 


Since 7* is an isotone transformation of 7’, both tests are equivalent. 


6.4.3.2 One-Sided Test Problem 


In analogy to the considerations in Sect. 6.4.2, the LR test for the one-sided problem 
Hy : 6* = 6 against H, : 0* > @ rejects if 9 > O and T exceeds a suitable 
critical value 3. 


Exercise 6.4.7. Derive the explicit form of the LR test for the one-sided test prob- 
lem. Compute 3, for given significance level w in terms of Student’s ¢-distribution. 


6.4.4 Testing the Variance 


In this section, we consider the LR test for the hypothesis Ho : o 


= 0, against 
H, : o? > of or Hy : o? = of against H, : o* # of, respectively. In this, 
we assume that 0* is known. The case of unknown 6* can be treated similarly, cf. 
Exercise 6.4.10 below. As discussed before, maximization of the likelihood under 


the constraint of known mean yields 


62 © argmax L(6*,0”) = n7! iO —9*y. 


The LR test for Hp against H rejects the null hypothesis if 
T = L(6*,62) — L(6*,03) 
exceeds a critical value 3. For determining the rejection regions, notice that 


def ~2 


2T =n(Q—log(Q)—-1), QO = 6Z/o5. (6.6) 


Exercise 6.4.8. (a) Verify representation (6.6). 
(b) Show that x x — log(x) — 1 is a convex function on (0, 00) with minimum 
value 0 atx = 1. 
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Combining Exercise 6.4.8 and Corollary 6.4.1, we conclude that the critical value 
3 for the LR test for the one-sided test problem is a deterministic transformation 
of a suitable quantile of the y2-distribution. For the two-sided test problem, the 
acceptance region of the LR test is an interval for Q which is bounded by 
deterministic transformations of lower and upper quantiles of the y?-distribution. 


Exercise 6.4.9. Derive the rejection regions of the LR tests for the one- and the 
two-sided test problems explicitly. 


Exercise 6.4.10. Derive one- and two-sided LR tests for o? in the case of unknown 
6*. Hint: Use Theorem 6.4.2(b). 


6.5 LR Tests: Further Examples 


We return to the models investigated in Sect. 2.9 and derive LR tests for the 
respective parameters. 


6.5.1 Bernoulli or Binomial Model 


Assume that (Y;,..., Y,) are iid. with Y; ~ Bernoulli(6*). Letting S, = >> Y;, 
the log-likelihood is given by 


L(9) = )_{Y; log + (1 — ¥;) log(1 — 6)} 


= S,, log + nlog(1 — 8). 


1-0 


6.5.1.1 Simple Null Versus Simple Alternative 


The LR statistic for testing the simple null hypothesis Hp : 0* = Oo against the 
simple alternative H; : 0* = 6; reads as 


1(1 — %) 1-4 


PER OG aay Ley 


For a fixed significance level a, the resulting LR test rejects if 


a> [ 3a l=) (1 — 4) 
ge | Mai log, 6, & Op, 
aC op) / wept oe 
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where 0 = S,, /n. For the determination of 3a, it is convenient to notice that for any 
pair (9, 01) € (0, 1)?, the function 


xe xlo a ae 1-6 
“piso bey 


is increasing (decreasing) in x € [0, 1] if 6: > % (@1 < 6). Hence, the LR statistic 
T is an isotone (antitone) transformation of 6 if 6; > 0 (0; < 9). Since S, = n@ 
is under Ho binomially distributed with parameters 1 and Op, the LR test @ is given 
by 


_ 1{S, > Finn) —a)}, 0, > 4, 


d — 
US, < Fesin(n,6y) O)}> 0, < Op. 


(6.7) 


6.5.1.2 Composite Alternatives 
Obviously, the LR test in (6.7) depends on the value of 6, only via the sign of 
6, — 0. Therefore, the LR test for the one-sided test problem Ho : 6* = 9p against 
A, : 0* > 6 rejects if 
Sn > Feincn,4)C — &) 
and the LR test for Hp against H, : 0* < p rejects if 
Sn < Fisinn 6p) (@)- 


The LR test for the two-sided test problem Ho against H, : 0* A Gp rejects if 


Sn ¢ [Fasincn 6) (a/2), Fein(n.éo) Cl — a/2)). 


6.5.2 Uniform Distribution on [0, 0] 


Consider again the model from Sect. 2.9.4, i.e., Y1,...,¥Y, are iid. with Y; ~ 
UNI[0, @*], where the upper endpoint 0* of the support is unknown. It holds that 


Z(@) = 0 "1(max ¥; < 0) 


and that the maximum of Z(@) over (0,00) is obtained for 6 = max; Y;. Let us 
consider the two-sided test problem Ho : 0* = Oo against H, : 0* 4 0 for some 
given value 6) > 0. We get that 
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(ry = ZO _ S(60/8)", max; ¥i < 60, 
ex = ene 
. Z() 00, max; Y; > 64. 


It follows that exp(T) > 3 if 6 > ord < 693 !/". We compute the critical value 
3a for a level a test by noticing that 


Poy ({8 > 80} U {8 < 43-3) = Pa (8 < 403°") = (Fy, (803° /"))" = 1/3. 
Thus, the LR test at level @ for the two-sided problem Ho against H, is given by 


b = 1{4 > }+ 1{6 < Hal}. 


6.5.3 Exponential Model 


We return to the model considered in Sect. 2.9.7 and assume that Y;,..., Y, are 
i.i.d. exponential random variables with parameter 06* > 0. The corresponding log- 
likelihood can be written as 


L(@) = -nlogd— 5 Y;/8 = —S/0—nlogd, 
i=l 
where S = Yj) +... + Yp. 


In order to derive the LR test for the simple hypothesis Ho : 6* = 6p against the 
simple alternative Ho : 6* = 6), notice that the LR statistic T is given by 


6; — % A 
T = L(61) ~L(6) = $( 0,0 ) + mtog 2. 


Since the function 


xX 51 — % + n log(o/) 
919 


is increasing (decreasing) in x > 0 whenever 0; > 6 (0; < 9p), the LR test 
rejects for large values of S in the case that 6; > 6 and for small values of S 
if 0, < O. Due to the facts that the exponential distribution with parameter 6p is 
identical to Gamma(6p, 1) and the family of gamma distributions is convolution- 
stable with respect to its second parameter whenever the first parameter is fixed, we 
obtain that S is under 69 distributed as Gamma(o, 7). This implies that the LR test 
¢ for Ho against Hj at significance level a is given by 
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— Ss a Faammatboaay = a) 5, 0; > 89, 
1{s < FGammatan) (CO) A < Oo. 


Moreover, composite alternatives can be tested in analogy to the considerations in 
Sect. 6.5.1. 


6.5.4 Poisson Model 


Let Y,...,Y, be iid. Poisson random variables satisfying P(Y; = m) = 
|0*|"e~*" /m! form = 0,1,2,.... According to Sect. 2.9.8, we have that 


L(@) = Slog9—-—n0+ R, 


f) 
L(6;) — L(0) = S log a + n(6 — 41), 
0 


where the remainder term R does not depend on @. In order to derive the LR test for 
the simple hypothesis Ho : 6* = 4 against the simple alternative Hp : 0* = 61, 
we again check easily that x > x log(@,/4o) +7(@ — 91) is increasing (decreasing) 
inx > 0 if 6; > 4 (® < 9). Convolution stability of the family of Poisson 
distributions entails that the LR test @ for Ho against H, at significance level a is 
given by 


= 1s <a Fecisson(nt) A = a)}, 0; > 60, 
Ss = Te actor (a)}, A; < Oo. 


Moreover, composite alternatives can be tested in analogy to Sect. 6.5.1. 


6.6 Testing Problem for a Univariate Exponential Family 


Let (Pp,6 € © C R)') be a univariate exponential family. The choice of 
parametrization is unimportant, any parametrization can be taken. To be specific, we 
assume the natural parametrization that simplifies the expression for the maximum 
likelihood estimate. 

We assume that the two functions C(@) and B(@) of @ are fixed, with which the 
log-density of Pg can be written in the form: 


def 


L(y, 0) = log p(y, @) = yC(@) — B(@) — L(y) 
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for some other function €(y). The function C(@) is monotonic in 6 and C(6) and 
B(@) are related (for the case of an EFn) by the identity B’(@) = 0C’(O), see 
Sect. 2.11. 

Let now Y = (¥,,..., Y,)' be ani.id. sample from Py» for 9* € ©. The task is 
to test a simple hypothesis 6* = 6 against an alternative 6* € ©, for some subset 
©, that does not contain 6. 


6.6.1 Two-Sided Alternative 


We start with the case of a simple hypothesis Hy : 0* = 6 against a full two- 
sided alternative H, : 0* # 6. The likelihood ratio approach suggests to compare 
the likelihood at 0) with the maximum of the likelihood over the alternative, which 
effectively means the maximum over the whole parameter set ©. In the case of a 
univariate exponential family, this maximum is computed in Sect. 2.11. For 

def 


L(6, 0) = L(@) — L(O) = S[C(8) — C(4)] — n[ BCA) — B(O)| 
with S = Y, +...+ Y,, it holds 


T © sup L(6, 0) = nK(8, 4), 
6 


where K(6,6’) = Eg€(6,0") is the Kullback—Leibler divergence between the 
measures Py and Pg. For an EFn, the MLE § is the empirical mean of the 
observations Y;, 9 = S/n, and the KL divergence K(0, 0) is of the form 


K(8, 6) = O[C(8) — C(6)] — [B(8) — B(o)]. 
Therefore, the test statistic T is a function of the empirical mean 6=S /n: 
T =nK(O, 6) = n6[C(B) — C(%)] —n[ B(O) — B(O)]. (6.8) 


The LR test rejects Ho if the test statistic T exceeds a critical value 3. Givena € 
(0, 1), a proper CV 3, can be specified by the level condition 


Pa (T > 3a) =a. 


In view of (6.8), the LR test rejects the null if the “distance” K(6, 4) between 
the estimate 6 and the null 6 is significantly larger than zero. In the case of an 
exponential family, one can simplify the test just by considering the estimate 0 as 
test statistic. We use the following technical result for the KL divergence K(0, 6): 
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Lemma 6.6.1. Let (Ps) be an EFn. Then for every 3 there are two positive values 
t— (3) and t* (3) such that 


{0 : K(8, 0) < 3} = {8 : 0-1 (3) < 9 < +147 (3)}. (6.9) 
In other words, the conditions K(6, 6)) < 3 and 6) —t~(3) < 6 < 6 + 4*(3) are 


equivalent. 


Proof. The function K(6, 9) of the first argument 0 fulfills 


dK(O, A) 0?K(8, 00) 


5g = C8) — CH), og = C8) > 0. 


Therefore, it is convex in 9 with minimum at 6, and it can cross the level 3 only 
once from the left of # and once from the right. This yields that for any 3 > 0, there 
are two positive values ¢~ (3) and ¢* (3) such that (6.9) holds. Note that one or even 
both of these values can be infinite. 


Due to the result of this lemma, the LR test can be rewritten as 


@=1(T >3) =1-1(T <3) 


II 


1—1(-17(g) < 6 — % <1*(s)) 
1(6 >  +1*(3)) + 1(6 < 0% —t-@)), 


II 


that is, the test rejects the null hypothesis if the estimate 6 deviates significantly 
from 6. 


6.6.2 One-Sided Alternative 


Now we consider the problem of testing the same null Ho : 6* = po against the 
one-sided alternative H, : 06* > 6. Of course, the other one-sided alternative Hy : 
0* < Q can be considered analogously. 

The LR test requires computing the maximum of the log-likelihood over the 
alternative set {0 : 6 > 6}. This can be done as in the Gaussian shift model. 
If 6 > 6, then this maximum coincides with the global maximum over all 6. 
Otherwise, it is attained at 0 = 4. 


Lemma 6.6.2. Let (Ps) be an EFn. Then 


Su L(G, ) = nK(6, 8) if 6 = 0, 


A>6 otherwise. 
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Proof. It is only necessary to consider the case 8 < 0%. The difference L(@) - 
L(@) can be represented as S[C(@) — C(60)] —n[B(@) — B(A)]. Next, usage of the 
identities 90 = S/n and B’(6) = 6C'(8) yields 

dL(0,%) OL (6,0) aK(O,0) ; 


for any 0 > 6. This implies that L(6) — L(@)) becomes negative as 6 grows beyond 
Oo and thus, L(6, 0) has its supremum over {@ : 6 > 6} at 6 = 9p, yielding the 
assertion. 


This fact implies the following representation of the LR test in the case of a 
one-sided alternative. 


Theorem 6.6.1. Let (Ps) be an EFn. Then the a-level LR test for the null Ho : 
@* = 6p against the one-sided alternative H, : 0* > Qo is 


6 =1(6 > + te), (6.10) 


where ty is selected to ensure Pa (6 > O+ tw) =a. 


Proof. Let T be the LR test statistic. Due to Lemmas 6.6.2 and 6.6.1, the event 
{T > 3} can be rewritten as {0 > 0 + ¢(3)} for some constant t(3). It remains to 
select a proper value t(3) = fy to fulfill the level condition. 


This result can be extended naturally to the case of a composite null hypothesis 
Ho: 6* < 0. 


Theorem 6.6.2. Let (Po) be an EFn. Then the a-level LR test for the composite 
null Ho : 0* < 0 against the one-sided alternative H, : 0* > 0 is 


ox = 1(0 > 0 + ta), (6.11) 


where ty is selected to ensure Pg, (6 > A+ tw) =a. 


Proof. The same arguments as in the proof of Theorem 6.6.1 lead to exactly the 
same LR test statistic T and thus to the test of the form (6.10). In particular, the 
estimate @ should significantly deviate from the null set. It remains to check that the 
level condition for the edge point 6p ensures the level for all 6 < 0. This follows 
from the next monotonicity property. 


Lemma 6.6.3. Let (P9) be an EFn. Then for any t > 0 
Po(6>O +t) <Pa(0@>M+t), VO<b. 


Proof. Let 6 < 0. We apply 


Po(6 > % +1) = Eq [exp{L io. 6)}11(8 > O + | ; 
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Now the monotonicity of L(@, 69) w.r.t. 6 (see the proof of Lemma 6.6.2) implies 
L(6, 89) < 0 on the set {0 < 0) < 0}. This yields the result. 


Therefore, if the level is controlled under Pg,, it is well checked for all other points 
in the null set. 


A very nice feature of the LR test is that it can be universally represented in terms 
of 6 independently of the form of the alternative set. In particular, for the case of a 
one-sided alternative, this test just compares the estimate 0 with the value 9 + fy. 
Moreover, the value ty only depends on the distribution of 6 under Pg, via the level 
condition. This and the monotonicity of the error probability from Lemma 6.6.3 
allow us to state the nice optimality property of this test: #7 is uniformly most 
powerful in the sense of Definition 6.1.1, that is, it maximizes the test power under 
the level constraint. 


Theorem 6.6.3. Let (Po) be an EFn, and let $; be the test from (6.11) for testing 
Ho : 0* < 6 against H, : 0* > 6. For any (randomized) test @ satisfying Eg, < a 
and any @ > o, it holds 


Eod < Po(¢; = 1). 


In fact, this theorem repeats the Neyman—Pearson result of Theorem 6.2.2, 
because the test @* is at the same time the LR a-level test of the simple hypothesis 
0* = against 0* = 6), for any value 0; > Op. 


6.6.3 Interval Hypothesis 


In some applications, the null hypothesis is naturally formulated in the form that 
the parameter 0* belongs to a given interval [9,6]. The alternative H,; : 0* € 
© \ [@, 4] is the complement of this interval. The likelihood ratio test is based on 
the test statistic T from (6.3) which compares the maximum of the log-likelihood 
L(@) under the null [6o, 9;] with the maximum over the alternative set. The special 
structure of the log-likelihood in the case of an EFn permits representing this test 
statistics in terms of the estimate 6: the hypothesis is rejected if the estimate 0 
significantly deviates from the interval [9, 0]. 


Theorem 6.6.4. Let (Ps) be an EFn. Then the a-level LR test for the null Hp : 0 € 
[@0, 81] against the alternative H, : @ ¢ [0,1] can be written as 


6 =1(6 > 6 +17) +16 < —1), (6.12) 


where the non-negative constants t,* and ty are selected to ensure Po, (6 < O- 
tr) +Po(@>6+tt) =a. 
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Exercise 6.6.1. Prove the result of Theorem 6.6.4. _ 
Hint: Consider three cases: 6 € [%, i], 9 > 61, and 6 < 6p. For every case, apply 
the monotonicity of L(6, @) in 0. 


One can consider the alternative of the interval hypothesis as a combination of 
two one-sided alternatives. The LR test ¢ from (6.12) involves only one critical 
value 3 and the parameters f; and t;' are related via the structure of this test: they are 
obtained by transforming the inequality T > 3, into 6 > 6; + t+ and 6 < — ae 
However, one can just apply two one-sided tests separately: one for the alternative 
H, : 0* < and one for H,* : 6* > 61. This leads to the two tests 


~21(<-1), ot S1(6>6,417). (6.13) 


The values ¢~ ,t* can be chosen by the so-called Bonferroni rule: just perform each 
of the two tests at level w/2. 


Exercise 6.6.2. For fixed a € (0, 1), let the values ty, 1;* be selected to ensure 
Pa(9<%—t,)=a/2, Py (6>0+4') =a/2. 
Then for any 6 € [6o, 0;], the test ¢, given by 6 = @ +47 (cf. (6.13)) fulfills 
Poe(@ = 1) <a. 


Hint: Use the monotonicity from Lemma 6.6.3. 


6.7 Historical Remarks and Further Reading 


The theory of optimal tests goes back to Jerzy Neyman (1894-1981) and Egon 
Sharpe Pearson (1895-1980). Some early considerations with respect to likelihood 
ratio tests can be found in Neyman and Pearson (1928). Neyman and Pearson 
(1933)’s fundamental lemma is core to the derivation of most powerful (likelihood 
ratio) tests. Fisher (1934) showed that uniformly best tests (over the parameter 
subspace defining a composite alternative) only exist in one-parametric exponential 
families. More details about the origins of the theory of optimal tests can be found 
in the book by Lehmann (2011). 

Important contributions to the theory of tests for parameters of a normal 
distribution have moreover been made by William Sealy Gosset (1876-1937; under 
the pen name “Student,” see Student (1908)) and Ernst Abbe (1840-1905; cf. 
Kendall (1971) for an account of Abbe’s work). 

The x?-test for goodness-of-fit is due to Pearson (1900). The phenomenon that 
the limiting distribution of twice the log-likelihood ratio statistic in nested models is 
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under regularity conditions a chi-square distribution, has been discovered by Wilks 
(1938). 

An excellent textbook reference for the theory of testing statistical hypotheses is 
Lehmann and Romano (2005). 


Chapter 7 
Testing in Linear Models 


This chapter discusses testing problems for linear Gaussian models given by the 
equation 


Y=f*+e (7.1) 


with the vector of observations Y , response vector f* , and vector of errors e in 
IR” . The linear parametric assumption (linear PA) means that 


Y=wW'o*+e, e~N(0,S), (7.2) 


where W is the p xn design matrix. By 6 we denote the p-dimensional target 
parameter vector, 9 € © C R?. Usually we assume that the parameter set 
coincides with the whole space R?, i.e. © = R?. The most general assumption 
about the vector of errors ¢ = (€),... En)! is Var(e) = &, which permits for 
inhomogeneous and correlated errors. However, for most results we assume 1.i.d. 
errors ¢; ~ N(0,07). The variance o? could be unknown as well. As in previous 
chapters, @* denotes the true value of the parameter vector (assumed that the model 
(7.2) is correct). 


7.1 Likelihood Ratio Test for a Simple Null 


This section discusses the problem of testing a simple hypothesis Ho: 0* = 00 for 
a given vector 09. A natural “non-informative” alternative is H,:0* 4 00. 


V. Spokoiny and T. Dickhaus, Basics of Modern Mathematical Statistics, 223 
Springer Texts in Statistics, DOI 10.1007/978-3-642-39909-1__7, 
© Springer-Verlag Berlin Heidelberg 2015 
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7.1.1 General Errors 


We start from the case of general errors with known covariance matrix ». The 
results obtained for the estimation problem in Chap. 4 will be heavily used in our 
study. In particular, the MLE 6 of 0* is 


6=(wE'w') wy 
and the corresponding maximum likelihood is 
L(6, 00) = 5 ~ 6o)' B(6 — 00) 
with a p X p-matrix B given by 
B=wE'w!, 


This immediately leads to the following representation for the likelihood ratio (LR) 
test in this setup: 


def 


T © sup L(6, 00) = (6 — 00) | B(6 — 80). (7.3) 
0 


1 
2 
Moreover, Wilks’ phenomenon claims that under Pg, , the test statistic T has a 


fixed distribution: namely, 27 is X>-distributed (chi-squared with p degrees of 
freedom). 


Theorem 7.1.1. Consider the model (7.2) with ¢ ~ N(0, X) for a known matrix 
x. Then the LR test statistic T is given by (7.3). Moreover, if 3a fulfills P(fp» > 
23a) = a with 6, ~ x3,, then the LR test @ with 


¢ = 1(T > 3a) (7.4) 
is of exact level a: 
Po, (? = 1) =a. 


This result follows directly from Theorems 4.6.1 and 4.6.2. We see again the 
important pivotal property of the test: the critical value 3, only depends on the 
dimension of the parameter space ©. It does not depend on the design matrix WV, 
error covariance >, and the null value 69. 
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7.1.2 Lid. Errors, Known Variance 


We now specify this result for the case of i.i.d. errors. We also focus on the residuals 


rt oe w@-y— f. 


where f = W'@ is the estimated response of the true regression function 
;7* -_ wl Q* 

We start with some geometric properties of the residuals € and the test statistic 
T from (7.3). 


Theorem 7.1.2. Consider the model (7.1). Let T be the LR test statistic built under 
the assumptions f* = ¥'0* and Var(e) = 071, with a known value o?. Then 
T is given by 


1 .~ 
T= 55/976 - 00)” = sol F - Sol’. (7.5) 


Moreover, the following decompositions for the vector of observations Y and for 
the errors e = Y — fy hold: 


Y—-fy=(S-fo) +4. (7.6) 
IY — fol? =F — fol? + lel’. (7.7) 


where f — fo is the estimation error and € = Y — f is the vector of residuals. 


Proof. The key step of the proof is the representation of the estimated response 
f under the model assumption Y = f* + e as a projection of the data on the 
p-dimensional linear subspace £ in IR” spanned by the rows of the matrix VU: 


f =1Y =M(f* +e) 


where TI = WT (WUT) w is the projector onto £; see Sect. 4.3. Note that 
this decomposition is valid for the general linear model; the parametric form of 
the response f and the noise normality is not required. The identity wT - 
00) = f — fo follows directly from the definition implying the representation 
(7.5) for the test statistic JT . The identity (7.6) follows from the definition. Next, 


Tlf = fo andthus f — fy = I(Y — f 9). Similarly, 
é=Y-—f=(,—-MyY. 

As II and J, — I are orthogonal projectors, it follows 

IY — fol? = Un — MY + WY — fo)l? = ln — YIP + I — fod ll? 


and the decomposition (7.7) follows. 
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The decomposition (7.6), although straightforward, is very important for 
understanding the structure of the residuals under the null and under the alternative. 
Under the null Ho, the response f* is assumed to be known and coincides with 
Fo, so the residuals € coincide with the errors ¢ . The sum of squared residuals is 
usually abbreviated as RSS: 


def 
RSSp © ¥ — fol? 


Under the alternative, the response is unknown and is estimated by f . The residuals 
are € = Y — f resulting in the RSS 


def = 
RSS = ||¥ — f |)’. 


The decomposition (7.7) can be rewritten as 
RSSo = RSS +|| f — f oll’. (7.8) 


We see that the RSS under the null and the alternative can be essentially different 
only if the estimate f significantly deviates from the null assumption f* = fo. 
The test statistic T from (7.3) can be written as 


__ RSSy —RSS 


T 
202 


For the proofs in the remainder of this chapter, the following results concerning 
the distribution of quadratic forms of Gaussians are helpful. 


Theorem 7.1.3. 7. Let X ~ N, (fu, =), where X is symmetric and positive 
definite. Then, (X — w)' D'(X — pw) ~ x2. 

2. Let X ~N,(0, In), R asymmetric, idempotent (nxn)-matrix with rank (R) = 
r and B a (p Xn)-matrix with p <n. Then it holds 


(a) XTRX~ 7? 
(b) From BR = 0, it follows that X' RX is independent of BX. 


3. Let X ~N,,(0,I,) and R, S symmetric and idempotent (n xn)-matrices with 
rank (R) =r, rank(S) = s and RS = 0. Then it holds 


(a) X'RX and X'SX are independent. 
(b) X'RX/r ~ F 
XTSX/s nS 
Proof. For proving the first assertion, let ©!/? denote the uniquely defined, 


symmetric, and positive definite matrix fulfilling ©!/? - D!/2 = D, with inverse 


matrix ©-!/2. Then, Z & D1/2(X —) ~ NiO, In). The assertion Z'Z ~ y2 


follows from the definition of the chi-squared distribution. 
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For the second part, notice that, due to symmetry and idempotence of R, there 


exists an orthogonal matrix P with R = PD,P', where D, = (‘; ae 


Orthogonality of P implies W 2 ptxy~ N,, (0, Z,) . We conclude 
X'RX = X'R?X = (RX)' (RX) = (PD,W)'(PD,W) 


= W'D,P'™PD,W =W'D,W => We ~ x. 
i=l 
Furthermore, we have Z, = BX ~ N,, (0, B'B) , 22 = RX ~ N,(0, R), and 
Cov(Z1, Z2) = Cov(BX, RX) = B Cov(X) R' = BR = O. As uncorrelation 


implies independence under Gaussianity, statement 2.(b) follows. 


To prove the third part, let Z, pad SX ~ N,(0, S) and Z2 bad RX ~ N,,(0, R) and 


notice that Cov(Z,, Z2) = S Cov(X) R = SR = S'R' = (RS)! =0. Assertion 
(a) is then implied by the identities X'SX¥ = Z/Z, and X'RX = ZJ Zp. 
Assertion (b) immediately follows from 3(a) and 1. 


Exercise 7.1.1. Prove Theorem 6.4.2. 
Hint: Use Theorem 7.1.3. 


Now we show that if the model assumptions are correct, the LR test based on T 
from (7.3) has the exact level a and is pivotal. 


Theorem 7.1.4. Consider the model (7.1) with e ~ N(0,07I,) for a known value 
o”, implying that the ¢; are i.i.d. normal. The LR test ¢ from (7.4) is of exact level 
a. Moreover, f — fy and @ are under Po, zero-mean independent Gaussian 
vectors satisfying 


2T=0 7 If -fol~x. oer ~ xp: (7.9) 


Proof. The null assumption f* = f) together with II fy = f) implies now the 
following decomposition: 


f-fo=Me, é=e-TIle= (I, — Ie. 


Next, II and J,, — II are orthogonal projectors implying orthogonal and thus 
uncorrelated vectors Tle and (J, — Il)e. Under normality of ¢, these vectors 
are also normal, and uncorrelation implies independence. The property (7.9) for 
the distribution of Ile was proved in Theorem 4.6.1. For the distribution of 
é = (J, — I)e, we can apply Theorem 7.1.3. 


Next we discuss the power of the LR test @ defined as the probability of 
detecting the alternative when the response f* deviates from the null f,. In 
the next result we do not assume that the true response f{* follows the linear PA 
f'* ='@ and show that the test power depends on the value ||(f* — f,)||*. 
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Theorem 7.1.5. Consider the model (7.1) with Var(e) = 071, for a known value 
oa”. Define 


A=o |(f* — fol. 


Then the power of the LR test only depends on A, i.e. it is the same for all f* 
with equal A -value. It holds 


P(@ = 1) = P(lfi + VAP +8 +... +82 > 25a) 


with € = (&,...,&))' ~N(@,I,). 


Proof. It follows from f = TTY = (f* +e) and fy = If, for the test 
statistic T = (207)! || f — fo||? that 


T = (20°) '|TI(f* — fo) + Hell’. 


Now we show that the distribution of T depends on the response f* only via the 
value A. For this we compute the Laplace transform of T . 


Lemma 7.1.1. It holds for  < 1 


_bA | 


Pp 
Maw 7 7 Wel). 


g() = log Eexp{uT} = 
Proof. For a standard Gaussian random variable € and any a, it holds 
Eexp{ulé + al*/2} 
= e297) “1/2 / exp{ pax 4 ig? 2 = x*/2\dx 
= exp Mas, ee |_ fex)-S a (: se ) hac 
2 "3G —B)) J20 1—p 


Mawr me 


= exp} i 
2(1 — p) 


The projector IT can be represented as TI = U'A pU for an orthogonal transform 
U and the diagonal matrix A, = diag(1,...,1,0,..., 0) with only p unit 
eigenvalues. This permits representing 7 in the form 


Pp 
T=) +4;)/2 
j=l 


with iid. standard normal r.v. €; and numbers a; satisfying )° j a; = A. The 
independence of the €; ’s implies 
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yea, wa} _ AP 
g(L) >| 5) + gE — Stoe(t = 1) | = AE — E togtt - 1) 


as required. 


The result of Lemma 7.1.1 claims that the Laplace transform of T depends on f* 
only via A and so this also holds for the distribution of T . 


The distribution of the squared norm || + ||? for € ~ N(0,J,) and any 
fixed vector h € R? with ||h||? = A is called non-central chi-squared with 
the non-centrality parameter A. In particular, for each a@,a@; one can define the 
minimal value A providing the prescribed error a; of the second kind by the 
equation under the given level a : 


P(E + All? = 230) => 1—a subject to P(\\é ||? > 23a) < a (7.10) 


with ||h|? > A. The results from Sect. 4.6 indicate that the value 3, can be 
bounded from above by p + /2ploga~! for moderate values of a~!. For 
evaluating the value A, the following decomposition is useful: 


If + AI? — Al? — p = EI? — p+ 2h". 


The right-hand side of this equality is a sum of centered Gaussian quadratic and 
linear forms. In particular, the cross term 2hn'é is a centered Gaussian r.v. with the 
variance 4\|h||*, while Var(||& ||?) = 2p. These arguments suggest to take A of 
order p to ensure the prescribed power a . 


Theorem 7.1.6. For each a,a, € (0,1), there are absolute constants C and C 
such that (7.10) is fulfilled for \|h||? > A with 


Al? = /cploga + /Ciplogay!. 


The result of Theorem 7.1.6 reveals some problem with the power of the LR 
test when the dimensionality of the parameter space grows. Indeed, the test remains 
insensitive for all alternatives in the zone o~*||(f* — f9)||* < Cp and this zone 
becomes larger and larger with p. 


7.1.3 Li.d. Errors with Unknown Variance 


This section briefly discusses the case when the errors ¢; are i.i.d. but the variance 
o? = Var(e;) is unknown. A natural idea in this case is to estimate the variance 
from the data. The decomposition (7.8) and independence of RSS = ||Y — /'||? 
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and || f — f oll? are particularly helpful. Theorem 7.1.4 suggests to estimate o7 
from RSS by 


] us 
6? = RSS = ——lY — f ||’. 
n—p n—p 


Indeed, due to the result (7.9), o~? RSS ~ ve yielding 


2 
a (7.11) 
= p 


EG? =o’, Var 67 = 


and therefore, G? is an unbiased, root-n consistent estimate of o7 . 


Exercise 7.1.2. Check (7.11). Show that 62 — 2 —> 0. 


Now we consider the LR test (7.5) in which the true variance is replaced by its 
estimate G7: 


15 pp a (n= p)|f — fol _ RSSo—RSS 
° 2|¥— FP — 2RSS/(~— p)’ 


The result of Theorem 7.1.4 implies the pivotal property for this test statistic as well. 


Theorem 7.1.7. Consider the model (7.1) with e ~ N(O, o7I,) for an unknown 
value o?. Then the distribution of the test statistic T under P9, only depends on 
p and n-—p: 


n — p RSSo —RSS OF 


2 IT= n—p> 
' ? RSS ie 


where F,—p denotes the Fisher distribution with parameters p,n — p: 


7 = 2/ llé I? /p ) 
me IE n—pll?/@ — p) 

where § , and &,_ are two independent standard Gaussian vectors of dimension 

p and n— p, see Theorem 6.4.1. In particular, it does not depend on the design 

matrix WV, the noise variance o2, and the true parameter 0. 


This result suggests to fix the critical value 3 for the test statistic ar using the 
quantiles of the Fisher distribution: If f, is such that Fy,—p(ta) = 1—a, then 
ba = Pla/2. 

Theorem 7.1.8. Consider the model (7.1) with e ~ N(0,0°I,) for a unknown 
value 07. If Fyn—p(te) = 1—a@ and 3q = Pty/2, then the test 6 = U(T > 3a) 
is a level-a test: 
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Py (o= 1) =PolT = 3) =e. 


Exercise 7.1.3. Prove the result of Theorem 7.1.8. 


If the sample size n is sufficiently large, then G7 is very close to o7 and one can 
apply an approximate choice of the critical value 3, from the case of o? known: 


o =(T > 3a). 


This test is not of exact level a but it is of asymptotic level a. Its power 
function is also close to the power function of the test @ corresponding to the 


known variance o2. 


Theorem 7.1.9. Consider the model (7.1) with e ~ N(0,07I,) for a unknown 
value o?. Then 


jim Po.(¢ = 1) =a. (7.12) 
Moreover, 
lim sup|P9,(¢ = 1) — P¢,(¢ = 1)| = 0. (7.13) 


noo f* 


Exercise 7.1.4. Consider the model (7.1) with e ~ N(0,07J,) for o? unknown 
and prove (7.12) and (7.13). 
Hints: 


* The consistency of G7 permits to restrict to the case |a? /o* — 1| < 6, for 
bn > 0. : 

* The independence of || f — fo |? and G? permits to consider the distribution of 
2T = ||f — foll?/e? as if &* were a fixed number close to 8. 

e ~ 2 
Use that for 6, ~ x;,. 


P(6p = 3a(1 + bn)) —P(bp = 3a) +0, noo. 


7.2 Likelihood Ratio Test for a Subvector 


The previous section dealt with the case of a simple null hypothesis. This section 
considers a more general situation when the null hypothesis concerns a subvector 
of @. This means that the whole model is given by (7.2) but the vector @ is 
decomposed into two parts: 6 = (y,7), where y is of dimension po < p. The 
null hypothesis assumes that 7 = , forall y. Usually 4) = O but the particular 
value of #9 is not important. To simplify the presentation we assume 79) = 0 
leading to the subset ©o of ©, given by 
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Oo = {8 = (y, 0)}. 
Under the null hypothesis, the model is still linear: 
_— yt 
Y= ce ye, 


where W, denotes a submatrix of Y composed by the rows of W corresponding 
to the y -components of 0. 

Fix any point 09 € @o, e.g., 90 = O and define the corresponding response 
fo = Y'00. The LR test 7 can be written in the form 


T = max L(6, 00) — max L(6, 60). (7.14) 

0<0 0< Qo 
The results of both maximization problems are known: 

1(0,00) = =sIlF — fol? 

max = —|f - 

EO one 202 f- fol’. 

1 - 

L(0,00) = —s|lfo— Soll’. 

Hes (0,00) = 55 Ifo - Fol 


where f and i are estimates of the response under the null and the alternative, 
respectively. As in Theorem 7.1.2 we can establish the following geometric 
decomposition. 


Theorem 7.2.1. Consider the model (7.1). Let T be the LR test statistic from (7.14) 
built under the assumptions f* = ¥'0* and Var(e) = 071, with a known value 
o?. Then T is given by 


Ff - Fol’ 


== | 


Moreover, the following decompositions for the vector of observations Y and for 
the residuals €) = Y — Fe from the null hold: 


Y—fo=F-fot+é4, 
IY — fol? =F — fol? + lel’. (7.15) 


where f = fo is the difference between the estimated response under the null and 
under the alternative, and € = Y — f is the vector of residuals from the alternative. 


Proof. The proof is similar to the proof of Theorem 7.1.2. We use that 7 = IY 
where IT = wT (WHT) is the projector on the space £ spanned by the rows 
of W. Similarly Te = IIoY where Ip = wy (v, WT) Wy is the projector 
on the subspace Lo spanned by the rows of W,. This yields the decomposition 
f —fo =U — fo). Similarly, 
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F-fo=(-My)Y, #*=¥Y-f =(,—-MyY. 
As II — To and I, — I are orthogonal projectors, it follows 
IY — fol? =n — DY + C1 — Mo) ¥ |? = |] — MY |? + GT — Ho) ¥ I? 


and the decomposition (7.15) follows. 


The decomposition (7.15) can again be represented as RSSo = RSS +207T , 
where RSS is the sum of squared residuals, while RSSo is the same as in the case 
of a simple null. 

Now we show how a pivotal test of exact level ~ based on T can be constructed 
if the model assumptions are correct. 


Theorem 7.2.2. Consider the model (7.1) with e ~ N(O, o71,) for a known value 
o*, ie. the €; are i.id. normal. Then f — fo and & are under Pg, zero-mean 


independent Gaussian vectors satisfying 
2T =o IF Fol? ~ Xm MIP ~ Xe-p 7-16) 


Let 3q fulfill P(Cp—p) = 3a) = &. Then the test ¢ =1(T > 3q/2) is an LR test of 
exact level a. 


Proof. The null assumption 0* € © implies f* € Lo. This, together with 
Tp f* = f* implies now the following decomposition: 


f-fo=M-Mpe, &#=e-Me=(,—Me. 


Next, IT — IIo and J,, — I are orthogonal projectors implying orthogonal and 
thus uncorrelated vectors (II — Ho)e and (J, — T)e. Under normality of e, 
these vectors are also normal, and uncorrelation implies independence. The property 
(7.16) is similar to (7.9) and can easily be verified by making use of Theorem 7.1.3. 


If the variance o? of the noise is unknown, one can proceed exactly as in the case 
of a simple null: estimate the variance from the residuals using their independence 
of the test statistic 7 . This leads to the estimate 


1 1 < 
= RSS = ly — fr’ 
P 


n—p n— 


and to the test statistic 


RSSo—RSS___ (n— DIF — Fol 


T= = = 
2RSS /(n — p) 2\¥ — fl? 


The property of pivotality is preserved here as well: properly scaled, the test statistic 
T has a Fisher distribution. 
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Theorem 7.2.3. Consider the model (7.1) with e ~ N(0,o7I,,) for an unknown 
value o7. Then 2T/(p — po) has the Fisher distribution Fy—pyn—p with 
parameters p— po and n— p.If ty is the 1—«a quantile of this distribution, then 
the test @ = \(T > (p — Po)tu/2) is of exact level a. 


If the sample size is sufficiently large, one can proceed as if G? were the true 
variance ignoring the error of variance estimation. This would lead to the critical 
value 3, from Theorem 7.2.2 and the corresponding test is of asymptotic level a. 


Exercise 7.2.1. Prove Theorem 7.2.3. 


The study of the power of the test 7 does not differ from the case of a simple 
hypothesis. One only needs to redefine A as 


def _ * 
A= 0 || — Mo) f*|. 


7.3 Likelihood Ratio Test for a Linear Hypothesis 


In this section, we generalize the test problem for the Gaussian linear model further 
and assume that we want to test the linear hypothesis Hjp:C@ = d for a given 
contrast matrix C € R’*? with rank(C) = r < p anda right-hand side vector 
d € R’. Notice that the point null hypothesis Ho:@ = @ 0 and the hypothesis 
Ho: = Mo regarded in Sect. 7.2 are linear hypotheses. Here, we restrict our 
attention to the case that the error variance o? is unknown, as it is typically the 
case in practice. 


Theorem 7.3.1. Assume model (7.1) with e ~ N(0,07I,) for an unknown value 
o? and consider the linear hypothesis Ho: C 0 = d . Then it holds 


ror . oe me ; . def y— 
(a) The likelihood ratio statistic T is an isotone transformation of F =! —P ARS ; 


where ARSS = RSSo—RSS. 


(b) The restricted MLE @o is given by 


65 = 6 —(WW") 'CTIC(wwT) CT (C6 —d). 


(c) ARSS and RSS are independent. 
(d) Under Hy, ARSS/o? ~ x2 and F ~ Fyn—p- 


Proof. For proving (a), verify that 27 = nlog(62/67), where 6? and 6? are 
the unrestricted and restricted (i.e., under Hp) MLEs of the error variance o7 , 
respectively. Plugging in of the explicit representations for 6* and 6% yields that 
2T = niog( ARs + 1), implying the assertion. Part (b) is an application of the 
following well-known result from quadratic optimization theory. 
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Lemma 7.3.1. Let A € R’*? and b € R’ fixed and define M = {z € R?: Az = 
bt. Moreover, let f:R? > R, given by f(z) =z' Oz/2—e'z fora symmetric, 
positive semi-definite (p x p)-matrix Q anda vector c € R?. Then, the unique 
minimum of f over the search space M is characterized by solving the system of 
linear equations 


Oz—A'y=e 
Az=b 


for (y,Z). The component z of this solution minimizes f over M. 


For part (c), we utilize the explicit form of 6 from part (b) and write ARSS in 
the form 


ARSS = (C6 —d)"{C(WW) "CT "(C6 — a). 


This shows that A RSS is a deterministic transformation of @ . Since 6 and RSS 
are independent, the assertion follows. This representation of ARSS moreover 
shows that A RSS /o? is a quadratic form of a (under Hp ) standard normal random 
vector and part (d) follows from Theorem 7.1.3. 


To sum up, Theorem 7.3.1 shows that general linear hypotheses can be tested 


with F -tests under model (7.1) with e ~ N(0,07J,,) for an unknown value o? . 


7.4 Wald Test 


The drawback of the LR test method for testing a linear hypothesis Hp:C@ = 
d without assuming Gaussian noise is that a constrained maximization of the 
log-likelihood function under the constraints encoded by C and d_ has to be 
performed. This computationally intensive step can be avoided by using Wald 
Statistics. The Wald statistic for testing Ho is given by 


W =(C6—d)'(CVC") ‘(Co —-a), 


where 6 and V denote the MLE in the full (unrestricted) model and the estimated 
covariance matrix of 0 , respectively. 


Theorem 7.4.1. Under model (7.2), the statistic W is asymptotically equivalent to 
the LR test statistic 2T . In particular, under Ho, the distribution of W converges 


to x7: W —> XG as n + oo. In the case of normally distributed noise, it holds 
W =rF, where F is as in Theorem 7.3.1. 
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Proof. For proving the asymptotic 2 -distribution of W , we use the asymptotic 
normality of the MLE @ . If the model is regular, it holds 


Jn(8n — 8) —> N(0, F(@0)~') under 00 for n + 00, 
and, consequently, 
F(00)/2n'/2(6, — 00) —> N(O,I,), where r © dim(0). 
Applying the Continuous Mapping Theorem, we get 
(81, — 8) "ME(80)(Bn — 80) — Xe 


If the Fisher information is continuous and F(O,) is a consistent estimator for 
F(@0), it still holds that 


(6, — 00) nF(6n)(On — 00) —> x2. 
Substituting 6, = C6 —d and 0 9 = 0€ R’, we obtain the assertion concerning 


the asymptotic distribution of W under Ho. For proving the relationship between 
W and F in the case of Gaussian noise, we notice that 


putt? ARSS 
r RSS 
n= p(CO—d) {CWT CT} "(C6 —d) 
ay (n — p)o? 
_ (C6—d)"(CVC")"(C6-d)_ WwW 
r r 


as required. 


7.5 Analysis of Variance 


Important special cases of linear models arise when all entries of the design matrix 
W are binary, ie., Yj; € {0,1} forall 1 <i < p, 1 < j <n. Every rowof 
W then has the interpretation of a group indicator, where Y;; = 1 if and only if 
observational unit 7 belongs to group i . The target of such an analysis of variance 
is to determine if the mean response differs between groups. In this section, we 
specialize the theory of testing in linear models to this situation. 
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7.5.1 Two-Sample Test 


First, we consider the case of p = 2, meaning that exactly two samples 
corresponding to the two groups A and B are given. We let n4 denote the number 
of observations in group A and ng = n—nz, the number of observations in group 
B. The parameter of interest @ € R? in this model consists of the two population 
means 74 and 7g and the model reads 


Yy = ni + ey, i € {A, B}, J = lysesghiz 


1 / nA 0 

0 1/nez 
Y4 and Yz denote the two sample means. With this, we immediately obtain that 
the estimator for the noise variance is in this model given by 


Noticing that (YW')-! = ( ) we get that 6 = (Y4,¥g)' , where 


is . (Ya; -—Yay + . (Ye; —Ys)y 
oOo = = — a= 
n—p na +np—2 va . a = 


j=l j=l 


The quantity s? = 6? is called the pooled sample variance. Denoting the 


group-specific sample variances by 


1 ZA _ 1 22 _ 
s4= 7 Yap —-Ya, 5h = a ¥ Geg=Fay, 
j=l j=l 


we have that 
a 2 2 
8° = (n4s4 +ngsz)/(n4 + np —2) 


is a weighted average of ov and ss ; 

Under this model, we are interested in testing the null hypothesis Ho:n4 = ng 
of equal group means against the two-sided alternative H\:n4 4 ng or the one- 
sided alternative H . :N4 > 1p. The one-sided alternative hypothesis Hy :n4 < 
ng can be treated by switching the group labels. 

The null hypothesis Ho:74 — ng = O is a linear hypothesis in the sense of 
Sect. 7.3, where the number of restrictions is r = 1. Under Ho, the grand average 


i€{A,B} j=l 


is the MLE for the (common) population mean. Straightforwardly, we calculate that 
the sum of squares of residuals under Ho is given by 
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RSSy= > yu, -¥) 


i€{A,B} j=1 


For computing A RSS = RSSo — RSS and the statistic F from Theorem 7.3.1 (a), 
we obtain the following results. 


Lemma 7.5.1. 


RSSo 


II 


RSS + (n4(Va—Y)? +ne(%p—-Y)’), (7.17) 


nang (Y4—Ye) 
—— 


7.18 
na+tng S ( ) 


Exercise 7.5.1. Prove Lemma 7.5.1. 


We conclude from Theorem 7.3.1 that, if the individual errors are homogeneous 
between samples, the test statistic 


_  Wa-Yz) 
SVJ1/n4+1/ng 


is t-distributed with n4-++ng—2 degrees of freedom under the null hypothesis Ho. 
For a given significance level a, define the quantile g y of Student’s t- 
distribution with 14 +g —2 degrees of freedom by the equation 


P(t > da) = a, 


where fo represents the distribution of ¢ in the case that Hp is true. Utilizing the 
symmetry property 


P(|to| > Ga/2) = 2P(t > qa/2) = a, 


the one-sided two-sample ¢-test for Ho versus He rejects if f > gq and the 
two-sided two-sample f -test for Ho versus Hy rejects if |t| > qu/2- 


7.5.2 Comparing K Treatment Means 


This section deals with the more general one-factorial analysis of variance in 
presence of K > 2 groups. We assume that K samples are given, each of size 
nx for k = 1,...,K. The following quantities are relevant for testing the null 
hypothesis of equal group (treatment) means. 

Sample means in treatment group k : 


Fe=— Yu 
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Sum of squares (SS) within treatment group k : 


Sk = >i -Y,) 


v 


Sample variance within treatment group k : 
sk = Sk/(1e — 1) 


Denoting the pooled sample size by N =n, +...+x,, we furthermore define 
the following pooled measures. 
Pooled (grand) mean: 


Within-treatment sum of squares: 
Sre=S;+...4+ 8x 


Between treatment sum of squares: 
Sr = om -YY 
k 


Within- and between-treatment mean square: 


In analogy to Lemma 7.5.1, the following results regarding the decomposition of 
spread holds. 


Lemma 7.5.2. Let the overall variation in the data be defined by 
def 7 
Sp => Mu - YY. 
koi 
Then it holds 


Sp = >\ > Mei — Ye)? + Do me (Pe — YY = Sr + Sr. 
i k 


k 


Exercise 7.5.2. Prove Lemma 7.5.2. 


The variance components in a one-factorial analysis of variance can be summa- 
rized in an analysis of variance table, cf. Table 7.1. 
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Table 7.1 Variance components in a one-factorial analysis of variance 


Source of Sum of Degrees of Mean 
variation squares freedom square 
Average S4= NY va=l st = S4/va 
Between 
treatments Sr =>y ng(Ye —Y)? vr =K-1 st = Sr/vr 
Within 
treatments Sr => >; Vu -— Vn? Ve =N-K Sh = Sr/Vp 
Total Sp => Yu — YY N-1 

For testing the global hypothesis Ho:71 = n2 = ... = nx against the (two- 


sided) alternative H,:4(i, 7) with nj # n; , we again apply Theorem 7.3.1, leading 
to the following result. 


Corollary 7.5.1. Under the model of the analysis of variance with K groups, 
assume homogeneous Gaussian noise with noise variance 0? > 0. Then it holds: 


(i) Sr/o* ~ Xy_K- 
(ii) Under Ho, Sr/o? ~ x%_,- 
(iii) Sp and Sr are stochastically independent. 


(iv) Under Ho, the statistic F = ae is distributed as Fx—\.n—K: 


def Sp/(K — 1) F 
= DO OY OEIK-LNGK - 

Sr/(N — K) 
Therefore, Ho is rejected byalevel a F -test if the value of F exceeds the quantile 
Fr-1,.N-K;1\-« Of Fisher’s F -distribution with K — 1 and N — K degrees of 
freedom. 


7.5.2.1 Treatment Effects 


For estimating the amount of shift in the mean response caused by treatment k (the 
k -th treatment effect), it is convenient to re-parametrize the model as follows. We 
start with the basic model, given by 


Yer = 1k + €ki, exi ~ N(O, 07) iid., 


where 7; is the true treatment mean. 
Now, we introduce the averaged treatment mean 7 and the & -th treatment effect 
Tr , given by 


af ly, gilt 
N kk, k Nk —- 
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The re-parametrized model is then given by 
Voi = t+ te + €x;. (7.19) 


It is important to notice that this new model representation involves K + 1 
unknowns. In order to achieve maximum rank of the design matrix, one therefore 
has to take the constraint ya nkte = O into account when building W, for 
instance by coding 


K-1 
TK = —ny y NkTk. 
k=1 


For the data analysis, it is helpful to consider the decomposition of the observed 
data points: 


Ya = Y¥ + (Ve —Y) + (Mui — Yu). 


Routine algebra then leads to the following results concerning inference in the 
treatment effect representation of the analysis of variance model. 
Theorem 7.5.1. Under Model (7.19) with + NgTte = 0, it holds: 


(i) The MLEs for the unknown model parameters are given by 
—s— ee ]%=Y¥e-Y,1<k <K. 


(ii) The F -statistic for testing the global hypothesis Ho: = t2 = = ]1 = 


macs k 
0 is identical to the one given in Theorem 7.5.1(iv), i.e, F = SES , 


7.5.3 Randomized Blocks 


This section deals with a special case of the two-factorial analysis of variance. We 
assume that the observational units are grouped into n blocks, where in each block 
the K treatments under investigation are applied exactly once. The total sample 
size is then given by N = nK. Since there may be a “block effect” on the mean 
response, we consider the model 


Ye =nt+pitutencr, 1<k<K, 1l<i<n, (7.20) 


where 7 is the general mean, B; the block effect, and t, the treatment effect. 
The data can be decomposed as 


Ya =Y+(V¥i -Y)+(Ve-Y) + (%i -Yi-Ye+Y), 
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Table 7.2 Analysis of variance table for a randomized block design 


Source of variation Sum of squares Degrees of freedom 
Average 

(correction factor) S=nk y 1 

Between 

blocks Sp =K>;(¥; -Y) n-1 

Between 

treatments Sp =n>d,¥,-YY K-1 

Residuals Sr = ded Me —Y¥i-YetY) (n — 1)(K — 1) 
Total Sp = de Ves — YY? N-1l=nkK-1 


where Y is the grand average, Y; the block average, Y, the treatment average, 
and Y;; -Y; -Yx + Y the residual. 

Applying decomposition of spread to the model given by (7.20), we arrive 
at Table 7.2. Based on this, the following corollary summarizes the inference 
techniques under the model defined by (7.20). 


Corollary 7.5.2. Under the model defined by (7.20), it holds: 


(i) The MLEs for the unknown model parameters are given by 


n=Y, B: =Yi-Y, %=YV,-Y. 
(ii) The F -statistic for testing the hypothesis Hg of no block effects is given by 


= Sp/(n—1) 
Sr/[(n — 1)(K — 1] 


Under Hg, the distribution of Fz is Fisher’s F -distribution with (n—1) and 
(n — 1)(K — 1) degrees of freedom. 

(iii) The F-statistic for testing the null hypothesis Hr of no treatment effects is 
given by 


B 


Sr/(K — 1) 


7 =e = Dea 0 


Under Hr, the distribution of Fr is Fisher’s F -distribution with (K — 1) 
and (n — 1)(K — 1) degrees of freedom. 
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7.6 Historical Remarks and Further Reading 


Hotelling (1931) derived a deterministic transformation of Fisher’s F -distribution 
and demonstrated its usage in the context of testing for differences among several 
Gaussian means with a likelihood ratio test. The general idea of the Wald test goes 
back to Wald (1943). 

A classical textbook on the analysis of variance is that of Scheffé (1959). The 
general theory of testing linear hypotheses in linear models is described, e.g., in the 
textbook by Searle (1971). 


Chapter 8 
Some Other Testing Methods 


This chapter discusses some nonparametric testing methods. First, we treat classical 
testing procedures such as the Kolmogorov—Smirnov and the Cramér—Smirnov—von 
Mises test as particular cases of the substitution approach. Then, we are considered 
with Bayesian approaches towards hypothesis testing. Finally, Sect. 8.4 deals with 
locally best tests. It is demonstrated that the score function is the natural equivalent 
to the LR statistic if no uniformly best tests exist, but locally best tests are aimed 
at, assuming that the model is differentiable in the mean. Conditioning on the ranks 
of the observations leads to the theory of rank tests. Due to the close connection of 
rank tests and permutation tests (the null distribution of a rank test is a permutation 
distribution), we end the chapter with some general remarks on permutation tests. 
Let Y = (Y,...,¥,)' be an iid. sample from a distribution P . The joint 
distribution P of Y is the n-fold product of P , so a hypothesis about P can be 
formulated as a hypothesis about the marginal measure P . A simple hypothesis 
Ho means the assumption that P = Pp for a given measure Po. The empirical 
measure P,, is a natural empirical counterpart of P leading to the idea of testing 
the hypothesis by checking whether P, significantly deviates from Po. As in the 
estimation problem, this substitution idea can be realized in several different ways. 
We briefly discuss below the method of moments and the minimal distance method. 


8.1 Method of Moments for an i.i.d. Sample 


Let g(-) be any d -vector function on R!. The assumption P = Po leads to the 
population moment 


Mop = Eog (1). 
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The empirical counterpart of this quantity is given by 
M, = Eyg(¥) =~ D(H) 
n ~~ ng = n & uy 


The method of moments (MOM) suggests to consider the difference M,, — mo 
for building a reasonable test. The properties of M,, were stated in Sect. 2.4. In 
particular, under the null P = Po, the first two moments of the vector M,, — mo 
can be easily computed: E)(M,, — my) = 0 and 


Vary(M) = E[(My — mo) (M, —mo)"] =n'V, 


V © Bol(g(Y) — mo)(g() — mo)" 


For simplicity of presentation we assume that the moment function g is selected to 
ensure a non-degenerate matrix V . Standardization by the covariance matrix leads 
to the vector 


é, =n'/?V-'(M, — mo), 


which has under the null measure zero mean and a unit covariance matrix. Moreover, 
é,, is, under the null hypothesis, asymptotically standard normal, i.e., its distribution 
is approximately standard normal if the sample size n is sufficiently large; see 
Theorem 2.4.4. The MOM test rejects the null hypothesis if the vector ,, computed 
from the available data Y is very unlikely standard normal, that is, if it deviates 
significantly from zero. We specify the procedure separately for the univariate and 
multivariate cases. 


8.1.1 Univariate Case 


Let g(-) be a univariate function with Eog(Y) = mo and Eo[g(Y) - mo} =o. 


Define the linear test statistic 


Ti = a d Ie) = mo| => n'/2g—!(M, — mo) 


leading to the test 


= 1(|T,| > za/2); (8.1) 


where z, denotes the upper a-quantile of the standard normal law. 
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Theorem 8.1.1. Let Y be an i.i.d. sample from P.. Then the test statistic T, is 
asymptotically standard normal under the null and the test @ from (8.1) for Ho : 
P = Py is of asymptotic level a, that is, 


Po(g@=1) >a, n->oo. 


Similarly one can consider a one-sided alternative H. i :m>mo or Hy :m < mo 
about the moment m = Eg(Y) of the distribution P and the corresponding one- 
sided tests 


gt =1(T, > ca), = MT < -2a). 


As in Theorem 8.1.1, both tests ¢* and ¢@~ are of asymptotic level a. 


8.1.2 Multivariate Case 


The components of the vector function g(-) € R¢ are usually associated with 
“directions” in which the null hypothesis is tested. The multivariate situation means 
that we test simultaneously in d > 1 directions. The most natural test statistic is 
the squared Euclidean norm of the standardized vector &,, : 


def = 
T, = IE, 117 = n||V 2(M, —mpo)||’. (8.2) 


By Theorem 2.4.4 the vector &, is asymptotically standard normal so that T,, 
is asymptotically chi-squared with d degrees of freedom. This yields the natural 
definition of the test @ using quantiles of re gles, 


¢ =1(T, > 3a) (8.3) 


with 3, denoting the upper a-quantile of the re distribution. 


Theorem 8.1.2. Let Y be ani.i.d. sample from P.. If 3q fulfills Pi > 3a) =a, 
then the test statistic T, from (8.2) is asymptotically a -distributed under the null 
and the test @ from (8.3) for Hy : P = Po is of asymptotic level a. 


8.1.3 Series Expansion 


A standard method of building the moment tests or, alternatively, of choosing the 
directions g(-) is based on some series expansion. Let ~1, W2,..., be a given set 
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of basis functions in the related functional space. It is especially useful to select 
these basis functions to be orthonormal under the measure Po: 


i (9) Po(dy) = 0, / vi Vi) Poldy) = 8,7, Vii" (BA) 


Select a fixed index d and take the first d basis functions W,...,Wg as 
“directions” or components of g . Then 


ie / wr; (9) Po(dy) = 0 


is the 7 th population moment under the null hypothesis Ho and it is tested by 
checking whether the empirical moments M,,, with 


det 1 
Min = — > 4; (%) 


n 


do not deviate significantly from zero. The condition (8.4) effectively permits to test 
each direction y; independently of the others. 
For each d one obtains a test statistic Tq with 


Th 2 (M2, aPeeee se Mj) 


leading to the test 
OT = U(Th.a = 30,4); 


where 3y,¢ is the upper w-quantile of Pe . In practical applications the choice of d 
is particularly relevant and is subject of various studies. 


8.1.4 Testing a Parametric Hypothesis 


The method of moments can be extended to the situation when the null hypothesis is 
parametric: Hy : P € (P9,@ € Oo). It is natural to apply the method of moments 
both to estimate the parameter 6 under the null and to test the null. So, we assume 
two different moment vector functions gq and g, to be given. The first one is 
selected to fulfill 


0 = Eog(X), 06 € Oo. 
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This permits estimating the parameter 0 directly by the empirical moment: 
~ | 
0 =~) 80(%). 


The second vector of moment functions is composed by directional alternatives. 
An identifiability condition suggests to select the directional alternative functions 
orthogonal to gy in the following sense. We choose g,; = Gg, sane yt : 
IR’ — RR such that for all 0 € Qo it holds g,(m,...,m,) =O€ R* , where 


(me: 1<£<r) denote the first r (population) moments of the distribution Po . 


Theorem 8.1.3. Let my =n! >~y_, Ye denote the £-th sample moment for 1 < 
£ < r. Then, under regularity assumptions discussed in Sect. 2.4 and assuming 
that each ge? is continuously differentiable and that all (me: 1 < € < 2r) are 
continuous functions of 0, it holds that the distribution of 


def me e = er a a 
T,, 2 ngl(i.....m,)V '@) g,(im,...,M,) 


converges under Hy weakly to XN , where 


V0) = J(g )EJ(g,)' e R*, 


c: i) 7 gies s r . 
ey 8 ( “ ™) ene’ 


om 


and X= (0i;) € R'™*" with O77 = Mi+j7 —MjiM;. 


Theorem 8.1.3, which is an application of the Delta method in connection with 
the asymptotic normality of MOM estimators, leads to the goodness-of-fit test 


g = Lr, > 3a); 


where 3, is the upper a-quantile of XN , for testing Ho. 


8.2 Minimum Distance Method for an i.i.d. Sample 


The method of moments is especially useful for the case of a simple hypothesis 
because it compares the population moments computed under the null with their 
empirical counterpart. However, if a more complicated composite hypothesis is 
tested, the population moments cannot be computed directly: the null measure is 
not specified precisely. In this case, the minimum distance idea appears to be useful. 
Let (Po,9 € © C R?”) bea parametric family and ©o be a subset of ©. The null 
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hypothesis about an iid. sample Y from P is that P € (P9,0 € ©o). Let 
p(P, P’) denote some functional (distance) defined for measures P, P’ on the real 
line. We assume that p satisfies the following conditions: o(P»9,, Pe,) = 0 and 
p(Pe,, Po,) = 0 iff 6; = @2. The condition P € (P9,@ € Oo) can be rewritten 
in the form 


inf p(P, Py) = 0. 
jones 6) 


Now we can apply the substitution principle: use P, in place of P. Define the 
value T by 


T © inf p(P,, Po). (8.5) 
0EQo 


Large values of the test statistic T indicate a possible violation of the null 
hypothesis. 

In particular, if Ho is a simple hypothesis, that is, if the set po consists of one 
point 09, the test statistic reads as T = p(P,,, Po,) . The critical value for this test 
is usually selected by the level condition: 


Po, (p(Pn, Po.) > tw) <a. 


Note that the test statistic (8.5) can be viewed as a combination of two different 
steps. First we estimate under the null the parameter 6 € ©o which provides the 
best possible parametric fit under the assumption P € (P9,@ € Oo): 


6 = arginf p(P,, Po). 
6€Oo 


Next we formally apply the minimum distance test with the simple hypothesis given 


by 00 = 60 7 
Below we discuss some standard choices of the distance p. 


8.2.1 Kolmogorov—Smirnov Test 


Let Po, P; be two distributions on the real line with distribution functions Fo, Fy : 
F;(y) = P;(Y < y) for j = 0,1. Define 


p(Po. Pi) = p(Fo. Fi) = sup| Fo(v) — Fi). (8.6) 


Now consider the related test starting from the case of a simple null hypothesis 
P = Po with corresponding c.d.f. Fo. Then the distance p from (8.6) (properly 
scaled) leads to the Kolmogorov—Smirnov test statistic 
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def 1/2 
T, = supn'/?|Fo(y) — Fn(y)|- 
y 


A nice feature of this test is the property of asymptotic pivotality. 


Theorem 8.2.1 (Kolmogorov). Let Fy be a continuous c.d.f. Then 


D 
Th = supn'/?| Fo(y) — F,(y)| > n, 
. 


where 1 is a fixed random variable (maximum of a Brownian bridge on [0, 1]). 


Proof. \dea of the proof: The c.d.f. Fo is monotonic and continuous. Therefore, its 
inverse function Fy! is uniquely defined. Consider the r.v.’s 


U; = Fy(Y;). 


The basic fact about this transformation is that the U;’s are i.i.d. uniform on the 
interval [0, 1]. 


Lemma 8.2.1. The rv.’s U; are i.i.d. with values in [0,1] and for any u € [0, 1] it 
holds 


P(U; < u) =U. 
By definition of Fy', it holds for any wu € [0, 1] 
Fo(F5 '(w)) =u. 


Moreover, if G,, is the c.d.f. of the U; ’s, that is, if 
Cy. "1; <u) 
n —— n ; ee ’ 


then 
G,(u) = Fi[Fo'(w)]- (8.7) 
Exercise 8.2.1. Check Lemma 8.2.1 and (8.7). 


Now by the change of variable y = Fj '(u) we obtain 


Tn = sup n'/?| Fo(Fy | (u)) - F,(Fy | (w))| = sup n'y — G,,(u)|. 
u€[0,1] u€[0,1] 


It is obvious that the right-hand side of this expression does not depend on the 
original model. Actually, it is for fixed n a precisely described random variable, and 
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so its distribution only depends on n. It only remains to show that this distribution 

for large n is close to some fixed limit distribution with a continuous c.d.f. allowing 

for a choice of a proper critical value. We indicate the main steps of the proof. 
Given a sample U;,...,U,, define the random function 


def 


E,(u) = n'?[u—G,(w)]. 


Clearly T, = sup,ejo1) §(u). Next, convergence of the random functions &, (-) 
would imply the convergence of their maximum over u € [0,1], because the 
maximum is a continuous functional of a function. Finally, the weak convergence 


of &,(-) — &(-) can be checked if for any continuous function h(u), it holds 


\= 


(Eh) = un f h(w[u—Grw]du > (&,h) 2 [ reagan 


Now the result can be derived from the representation 


(&:,h) = nef [h(u)Gn(w) — m(h)|du = 2S] U;h(U;) — m(h)| 


i=l 


with m(h) = ip h(u)&(u)du and from the central limit theorem for a sum of i.i.d. 
random variables. 


8.2.1.1 The Case of a Composite Hypothesis 


If Ho : P € (Pe,4 € Oo) is considered, then the test statistic is described by 
(8.5). As we already mentioned, testing of a composite hypothesis can be viewed as 
a two-step procedure. In the first step, 0 is estimated by 6 and in the second step, 
the goodness-of-fit test based on T,, is carried out, where Fo is replaced by the 
c.d.f. corresponding to P,. It turns out that pivotality of the distribution of 7, is 
preservedif @ is a location and/or scale parameter, but a general (asymptotic) theory 
allowing to derive tf, analytically is not available. Therefore, computer simulations 
are typically employed to approximate f, . 


8.2.2 w* Test (Cramér-Smirnov-von Mises) 


Here we briefly discuss another distance also based on the c.d.f. of the null measure. 
Namely, define fora measure P on the real line with c.d.f. F 


p(x, P) = pF, F) = 1 / [F.(y) — FO) Pa F(x). (8.8) 
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For the case of a simple hypothesis P = Po, the Cramér—Smirnov—von Mises 
(CSvM) test statistic is given by (8.8) with F = Fo. This is another functional of 
the path of the random function n!/ “\ Ep (vy) — Fo( y)| . The Kolmogorov test uses 
the maximum of this function while the CSvM test uses the integral of this function 
squared. The property of pivotality is preserved for the CSvM test statistic as well. 


Theorem 8.2.2. Let Fo be a continuous c.d.f. Then 


T=n / [Fa(y) — Fo(y) Pd F(x) © a, 


where 1 is a fixed random variable (integral of a Brownian bridge squared on 
[0, 1] ). 


Proof. The idea of the proof is the same as in the case of the Kolmogorov—Smirnov 
test. First the transformation by Fj! translates the general case to the case of the 
uniform distribution on [0,1]. Next one can again use the functional convergence 
of the process &,,(u) . 


8.3 Partially Bayes Tests and Bayes Testing 


In the above sections we mostly focused on the likelihood ratio testing approach. 
As in estimation theory, the LR approach is very general and possesses some 
nice properties. This section briefly discusses some possible alternative approaches 
including partially Bayes and Bayes approaches. 


8.3.1 Partial Bayes Approach and Bayes Tests 


Let ©o and ©, be two subsets of the parameter set © . We test the null hypothesis 
Ho : 0* € Oo against the alternative H, : 0* € ©,. The LR approach compares 
the maximum of the likelihood process over @o with the similar maximum over 
©, . Let now two measures 7 On ©o and z; on ©, be given. Now instead of the 
maximum of L(Y,6) we consider its weighted sum (integral) over © (resp. ©1) 
with weights z(@) resp. 21(@). More precisely, we consider the value 


Tx = i LY 8)x(8)2(d8) — | 


0 


L(Y, 0)10(0)A(d0). 


Significantly positive values of this expression indicate that the null hypothesis is 
likely to be false. 
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Similarly and more commonly used, we may define measures go and g; such 
that 


(9) = $7808). 8 € Gp, 
m12\(0), O9€O,, 


where z is a prior on the entire parameter space © and z; “ Jo, W(O)A(d0) = 
P(0;) for i = 0,1. Then, the Bayes factor for comparing Hp and H is given by 


Soo LY. A) 80(9)A(dA) _— P(®o| Y)/P(@1 | Y) 


To, LW. 0)gi(@)Ad8)  ~P(Oy)/POd) ne 


Bou = 


The representation of the Bayes factor on the right-hand side of (8.9) shows that it 
can be interpreted as the ratio of the posterior odds for Ho and the prior odds for 
A. The resulting test rejects the null hypothesis for significantly small values of 
Bo, , or, equivalently, for significantly large values of Bj) = 1/ Boy; . In the special 
case that Ho and H are two simple hypotheses, i.e., @o = {00} and ©; = {6;}, 
the Bayes factor is simply given by 


hence, in such a case the testing approach based on the Bayes factor is equivalent to 
the LR approach. 


8.3.2 Bayes Approach 


Within the Bayes approach the true data distribution and the true parameter value 
are not defined. Instead one considers the prior and posterior distribution of the 
parameter. The parametric Bayes model can be represented as 


Y|0~ply|6), 0 ~x(6). 
The posterior density p(0 | Y) can be computed via the Bayes formula: 


DY | 0)x(0) 
DY) 


with the marginal density p(Y) = Ie 104 | 0)2(0)A(d0). The Bayes approach 
suggests instead of checking the hypothesis about the location of the parameter 6 

to look directly at the posterior distribution. Namely, one can construct the so-called 
credible sets which contain a prespecified fraction, say 1 — a, of the mass of the 
whole posterior distribution. Then one can say that the probability for the parameter 


p(@|Y)= 
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@ to lie outside of this credible set is at most a. So, the testing problem in the 
frequentist approach is replaced by the problem of confidence estimation for the 
Bayes method. 


Example 8.3.1 (Example 5.2.2 Continued). Consider again the situation of a 
Bernoulli product likelihood for Y = eee a with unknown success 
probability @ . In example 5.2.2 we saw that this family of likelihoods is conjugated 
to the family of beta distributions as priors on [0,1]. More specifically, if 
6 ~ Beta(a,b), then 0| Y = y ~ Beta(a + 5,b +n—s), where s = )~"_, yi 
denotes the observed number of successes. Under quadratic risk, the Bayes-optimal 
point estimate for @ is given by E[6 | Y = y] = (a+s)/(a+b+4+n), anda 
credible interval can be constructed around this value by utilizing quantiles of the 
posterior Beta(a + s,b + n — s) -distribution. 


8.4 Score, Rank, and Permutation Tests 


8.4.1 Score Tests 


Testing a composite null hypothesis Ho against a composite alternative H, is in 
general a challenging problem, because only in some special cases uniformly (over 
0 © H,) most powerful level o -tests exist. In all other cases, one has to decide 
against which regions in H, optimal power is targeted. One class of procedures is 
given by locally best tests, optimizing power in regions close to Ho . To formalize 
this class mathematically, one needs the concept of differentiability in the mean. 


Definition 8.4.1 (Differentiability in the Mean). Let (Y,B(Y), (Ps)oeq) denote 
a statistical model and assume that (Pg)geq@ is a dominated (by yw) family of 
measures, where © C R. Then, (Y, B(Y), (Ps)oeq) is called differentiable in the 


fe} 
mean in 0) € ©, ifa function g € L\() exists with 


(See 7 =) — 


>0 ast—0. 
L\(“) 


The function g is called L,(j)-derivative of 0 +> Pg in 6. In the sequel, we 
choose w.l.o.g. 09 = 0. 
Theorem 8.4.1 (§18 in Hewitt and Stromberg (1975)). Under the assumptions of 


Definition 8.4.1 let 09 = 0 and let densities be given by fo(y) e dPo/(du)(y). 
Assume that there exists an open neighborhood U of 0 such that for ps -almost all 
y the mapping U 3 0 +> fo(y) is absolutely continuous, i.e., it exists an integrable 
function t+ f(y,t) on U with 


A. 
i 22 Oho. Oak. 
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and assume that t fo(y)|o=0 = f(y,0) j4-almost everywhere. Furthermore, 
assume that for @ € U the function y > f(y, 0) is je-integrable with 


[|fo.9|du0y> []f0.0auo, 60. 


Then, 60 +> Po is differentiable in the mean in 0 with g = f(-,0). 


Theorem 8.4.2. Under the assumptions of Definition 8.4.1 assume that the densi- 
ties 0+ fg are differentiable in the mean in 0 with L(t) derivative g. Then, 


6 log( fo()/fo(y)) = 8" (log fo(y) — log fo(y)) 


converges for @ + 0 to L(y) (say) in Po probability. We call L the derivative of 
the (logarithmic) likelihood ratio or score function. It holds 


LQ) = 8(y)/fo(y), [iav. = 0, {fo = 0} C {g = 0} Po -almost surely. 


Proof. 9"'(fo/fr — 1) > g/fo converges in L,(IPo) and, consequently, in 
Po probability. The chain rule yields L(y) = g(y)/fo(y). Noting that [(fo — 


fo)de = 0 we conclude 
[ iar = [sau = 0. 


Example 8.4.1. (a) Location parameter model: 
Let Y = 6+X,6 > 0,and assume that X hasadensity f which is absolutely 
continuous with respect to the Lebesgue measure and does not dependent on @. 
Then, the densities 0 > f(y—0) of Y under @ are differentiable in the mean 
in zero with score function L(y) = —f’(y)/f(y) (differentiation with respect 
to y). 

(b) Scale parameter model: 
Let Y = exp(0)X and assume again that X has density f with the properties 
stated in part (a). Moreover, assume that f |xf’(x)|dx < oo. Then, the 
densities 0 +> exp(—@) f(y exp(—@)) of Y under 6 are differentiable in the 
mean in zero with score function L(y) = —(1 + yf’(y)/f(y)). 


Lemma 8.4.1. Assume that the family 6 + Pg is differentiable in the mean with 
score function L in 09 = 0 and that c;, 1 <i <1, are real constants. Then, also 
6 +> @)}_, Peo is differentiable in the mean in zero, and has score function 


(V1s-065 In) > DG LGi). 


i=1 
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4 


Fig. 8.1 Locally best test @* with expectation @ under 6 


Exercise 8.4.1. Prove Lemma 8.4.1. 


Definition 8.4.2 (Score Test). Let 6 +» Pg be differentiable in the mean in 6 
with score function L . Then, every test ¢ of the form 


1, ifL(y)>c, 
dH)= )¥ if LO) =c, 
0, if L(y) <c, 


is called a score test. In this, y € [0, 1] denotes a randomization constant. 


Definition 8.4.3 (Locally Best Test). Let (P9)ge@ with © C R denote a family 


which is differentiable in the mean in 6) € ©. A test @* with Eg,[¢*] = @ is 
called locally best test among all tests with expectation @ under 69 for the test 
problem Ho = {69} versus H, = {0 > 6o} if 


d 
> —Eo[¢] 
0=6 dé 


! lb") 
Ty He 
dé = 


for all tests @ with Eg,[¢] =a. 


Figure 8.1 illustrates the situation considered in Definition 8.4.3 graphically. 
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Theorem 8.4.3. Under the assumptions of Definition 8.4.3, the score test 


1, if L(y) >c@) 
b= yr. FLO)=c@), yel0.1] 
0, if L(y) <c(@) 


with Ea[¢] = @ is a locally best test for testing Hy = {00} against H, = 
{8 > A}. 


Proof. We notice that for any test @, it holds 


d : 
abel] = Baldi 


Hence, we have to optimize (vy) Liy)Pa (dy) with respect to @ under the level 
constraint, yielding the assertion in analogy to the argumentation in the proof of 
Theorem 6.2.1. 


Theorem 8.4.3 shows that in the theory of locally best tests the score function i 
takes the role that the likelihood ratio has in the LR theory. Notice that, for an i.i.d. 
sample Y = (Y),..., Yo)". the joint product measure (P,)®” has score function 
(V1. .-+5 In) YL, LQ) according to Lemma 8.4.1 and Theorem 8.4.3 can be 
applied to test Hy = {} against H; = {0 > 0} basedon Y. 

Moreover, for k -sample problems with k > 2 groups and n jointly independent 
observations, Lemma 8.4.1 can be utilized to test the homogeneity hypothesis 


Ho = {p” =p? =...=p™;p" continuous}. (8.10) 


To this end, one considers parametric families 6 + P,,9 which belong to Ho only 
in case of 6 = 0,i.e., Pao € Ho. For 0 4 0, P,,9 is a product measure with 
non-identical factors. 


Example 8.4.2. (a) Regression model for a location parameter: 
Let Y; = c;0 + X;, 1 < i <n, where 0 > O. In this, assume that the 
X; are i.i.d. with Lebesgue density { which is independent of 6. Now, for a 
two-sample problem with n, observations in the first group and nz =n — ny, 
observations in the second group, we set c) = c2 = ++: = Cy, = 1 and c; = 0 
for all ny + 1 <i <n. Under alternatives, the observations in the first group 
are shifted by 0 > 0. 

(b) Regression model for a scale parameter: 
Let c;,1 <i <n, denote real regression coefficients and consider the model 
Y; = exp(c;0)X;, 1 <i <n, 0 € R, where we assume again that the X; are 
iid with Lebesgue density f which is independent of @ . Then, it holds 
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dP 
din 


(y) = | [exp(—c:) f(y; expe; 9). 


i=1 


Under 6 = 0, the product measure P,,9 belongs to Ho, while under 
alternatives it does not. 


8.4.2. Rank Tests 


In this section, we will consider the case that only the ranks of the observations are 
trustworthy (or available). Theorem 8.4.4 will be utilized to define resulting rank 
tests based on parametric families P,, ¢ as considered in Example 8.4.2. It turns out 
that the score test based on ranks has a very simple structure. 


Theorem 8.4.4. Let 6 t» Pg denote a parametric family which is differentiable 
in the mean in 0) = 0 with respect to some reference measure fv, L (JL) - 
differentiable for short, with score function L. Furthermore, let S : Y¥ + S 
measurable. Then, 9 Pe is L,(*) -differentiable with score function s +> 
Eo[L | S = s]. 

Proof. First, we show that the L\(j5)-derivative of 0 ++ P§ is given by s 


Enlg | S = s], where g is the L,(j,1)-derivative of 6 +> Py. To this end, notice 
that 


dP dP, 
dps (s) = Ex [fo | S=s], where fg = a 


Linearity of E,,[- | S] and transformation of measures leads to 


dps dps 
fle $8) -nete=s 


ie as du’ (y) 


= f B.0- fa fo) -8| SI] au. 


Applying Jensen’s inequality and Vitali’s theorem, we conclude that s t& 
Eulg | S = s] is L\(y°%)-derivative of 6 ++ P%. Now, the chain rule yields 
that the score function of Py in zero is given by 


dPo 


st> E,[g | S= aE ie: 


| S=s]}? 


and the assertion follows by substituting g = L dP /(dj) and verifying that 
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: ae 
Eo{L | S] Ex| ae : | S]= E iT? | 5 jt-almost surely. 
be 


For applying the latter theorem to rank statistics, we need to gather some basic 
facts about ranks and order statistics. 


Definition 8.4.4. Let y = (y1,...,¥n) bea point in R”. Assume that the y; are 
pairwise distinct and denote their ordered values by yq:7 < Van <... << Van. 


(a) For 1 <i <n, the integer r; = r;(y) = Hj €{l,...,n}:y; < y;} is called 
the rank of y; (in y). The vector r(y) = (ri(y).- .,Mn(y)) € Sy, is called 


rank vector of y. 


(b) The inverse permutation d(y) = (di(y),...,dn()) “ [r(y)]~! is called the 


vector of antiranks of y, and the integer a (y) is called antirank of i (the 
index that corresponds to the 7 -th smallest observation in y ). 


Now, let Y1,..., Y, with Y; : ¥; — R be stochastically independent, continuously 
distributed random variables and denote the joint distribution of (%,...,Y,) by 
P. 


(c) Because of PU 4% = Y;}) = 0 the following objects are P -almost surely 


uniquely defined: Y;., is called i-th order statistic of Y = (Y%,..., ¥,)" 5 
RAY) © nF(%) = ri(Yi...-.¥,) is called rank of ¥;, R(Y) © 
(Ri (Y),...,Rn(Y))! is called vector of rank statistics of Y, D,(Y) = 


di(Y%,.. _Yy) is called antirank of i with respect to Y and D(Y) = d(Y) 


is called ecto of antiranks of Y . 
Lemma 8.4.2. Under the assumptions of Definition 8.4.4, it holds 


(a) i ld, = d,,, Yi = Yrjin> Yin = Vd;- 

(b) If Y1,...,¥%n are exchangeable random variables, then R(Y) is uniformly 
distributed on S,, i.e, P(R(Y) = o) = 1/n! for all permutations o = 
("1,---,Tn) € Sp. 

(c) If U,...,Un arei.id. with U; ~ UNI[0,1], and ¥; = F7'(U;), 1 <i <n, 
for some distribution function F , then it holds Yj.) = Fo (Ujn). If F is 
continuous, then it holds R(Y) = R(U,,...,Un). 

(d) If (%,...,Yn) avetid. withc.df. F of Y,, then we have 


(i) Pin Sy) = Via G)FOVA- FO)". 

(ii) a (y) = n("— 1) Fy) 'd — F(y))"?. If PB” has Lebesgue den- 
sity f, then P%" has Lebesgue density fi, given by fin(y) = 
aia) FOY 10 — FO)" FO). 

(iii) Letting | = py, (Yien)i<i<n has the joint js" -density (\1,...,¥n) 
n} Wy, <y2<..<mn}- If ww has Lebesgue density f, then (Yj:n)i<i<n has 
A” -density (y1,.--, Yn) > al] Fay £O%1) py <y<...<yn} « 
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Remark 8.4.1. Lemma 8.4.2(c) (quantile transformation) shows the special impor- 
tance of the distribution of order statistics of ii.d. UNI[0, 1] -distributed random 
variables Uj,...,U, . According to Lemma 8.4.2(d), the order statistic U;., has 
a Beta(i,n —i + 1) distribution with E[U;.,] = i/(a + 1) and Var(Uj:,) = 
(i(n —i + 1))/((n + 1)?(n + 2)). For computing the joint distribution function 
of (Ui:n,...,Un-n), efficient recursive algorithms exist, for instance Bolshev’s 
recursion and Steck’s recursion (see Shorack and Wellner (1986), p. 362 ff.). 


Theorem 8.4.5. Let Y = (Yj,..., ate be a vector of real-valued i.i.d. random 
variables with continuous 4 =P". 


(a) The random vectors R(Y) and (Yj:n)1<i<n are stochastically independent. 
(b) Let T : IR” > R denote a mapping such that the statistic T(Y) is integrable. 
For any 0 = (r,..-;1n) € Sp, it holds 


E[T(Y) | R(Y) = o] = E[T (Y);:n)i<i<n)I- 


Proof. For proving part (a), let o = (r\,...,%») € S, and Borel sets A;, 1 <i < 


n, arbitrary but fixed and define (d),...,d,)  G-!. We note that R(Y) =o 


if and only if Yu, < Ya, < ... < Yg, and that Yy, = Y;., € A; if and only if 
Y; € A,, . 
Define B= {y ER": y1 < y2 <... < yn}. Then we obtain that 
P(R(Y) =0,V1 <i <1: Yin € Aj) 
= P(Vvil <i< n: Ya, E Aj, (Ya; i<i<n e B) 


=) Ip(va,,---, Va, dM" (1, -- +. Yn) 

xa Ar; 

=| TpQ1,.--,¥ndu" (1, ---5 Yn); 
xf Ar; 


because ju” is invariant under the transformation (y1,...,¥n) H® (Vay,--+s¥d,) 
due to exchangeability. 
Summation over all o € S,, yields 


P(VW1si<n: Yin € Aj) = at f Tp (y1,---.¥n)du"(1,-- +5 Yn)- 


Making use of Lemma 8.4.2(b), we conclude 


P(R(Y) =0,V1 <i <n: Yin € Aj) 
= P(R(Y) = 0)P(V1 <i <1: Yin € Ai), 
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hence, the assertion of part (a). For proving part (b), we verify 


; am aS 
BAY) Sale — P(R(Y) = 0) 


= E[T (Yi ,:n) 1<i<n) | R(Y) = o] 
= E[T (Yi;:n)i<i<n)I, 


where we used that Y = (Y;,,:,)7_, if R(Y) = o in the second line and part (a) in 
the third line. 


Now we are ready to apply Theorem 8.4.4 to vectors of rank statistics. 


Corollary 8.4.1. Let (Po)aco with © © R denote an L (1) -differentiable 
family with score function L in 09 = 0. Let Y = (Y,...,Yn)' be a sample 
from Pie = Riazi P.,9. Then, PF, has score function 


O = (iy. 605Tn) > Exp bs cL(¥;)| RY) = ° 


i=1 


= VooEn lL) | RY) = 0] 


i=1 


= Yo ci EnolL Yn) = Yo cia(ri) 
i=l i=l 
with Eno denoting the expectation with respect to Py 9 and scores a(i) = 
E, 0 [L(Yi:n)] : 


Remark 8.4.2. (a) The test statistic T(Y) = )77_, c;a(Ri(Y)) is called a linear 
rank statistic. 

(b) The hypothesis Ho from (8.10) leads under conditioning on R(Y) to a simple 
null hypothesis on S,,, namely, the discrete uniform distribution on S,,, see 
Lemma 8.4.2(b). Therefore, the critical value c(@) for the rank test ¢ = 


$(R(Y)), given by 


1, if T(y) > c(@), 
b(y) = yy, if T(y) = c@), 
0, if T(y) < c(a), 


can be computed by traversing all possible permutations o € S, and thereby 
determining the discrete distribution of T(Y) under Ho. For large n , we can 
approximate c(a) by only traversing B < n! randomly chosen permutations 
o€ES,. 
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(c) For the scores, it holds )7”_, a(i) = 0. If L is isotone, then it holds a(1) < 
a(2) <...< a(n). 
(d) Due to the relation Y;-, 2 yo (U;:,) , the scores are often given in the form 


a(i) = E[L ) FU; -)] and the function L o F~! is called score-generating 
function. For large n , one can approximately work with b(i) "ho FO! (a ) 
(since E[Uj:n] = 1/(n + 1), see Remark 8.4.1) or with b(i) a ae Lo 


F~'(u)du instead of a(i). 


In the case that the score function is isotone, rank tests can also be used to test 
for stochastically ordered distributions in two-sample problems. 


Lemma 8.4.3 (Two-Sample Problems with Stochastically Ordered Distribu- 
tions). Assume that a(1) < a(2) < ... < a(n), cf. Remark 8.4.2(c), and let 
@ denote a rank test at level a for Ho from (8.10), i.e, Ex,[¢] = a. Assume that 
Y,,..., Yn, areii.d. withc.df. Fi of Y; and Yy,41,...,¥n itd. withe.df. Fy of 
Yniti- 


(a) If F\ > Fy, then E[¢] <a. 
(b) If F, < Fy, then Ed] >a. 


Proof. Lemma 4.4 in Janssen (1998). 


In location parameter models as considered in Example 8.4.1(a), the score 
function is isotone if and only if the density f is unimodal. The following example 
discusses some specific instances of such densities and derives the corresponding 
rank tests. 


Example 8.4.3 (Two-Sample Rank Tests in Location Parameter Models with 
“Stochastically Larger” Alternatives). 


(i) Fisher—Yates test: ; 
Let f denote the density of N(0,1).Then it holds L(y) = y and we obtain 


T= 5 a(R) with a(i) = E[Yju]. 


i=1 


In this, Y;-, denotes the i-th order statistic of iid. random variables 
Yi,...,¥, with Yj ~ N(0, 1). 

(ii) Van der Waerden test: 
Let f be as in part (i). The score-generating function is given by u b> 
®~!(u) . Following Remark 8.4.2(e), b(i) = ©! (i/(n + 1)) are approximate 
scores, leading to the test statistic 


ny 


R; 
T = + a Goer e 


i=1 
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(iii) Wilcoxon’s rank sum test: 
Let f be the density of the standard logistic distribution, given by f(y) = 
exp(—y)(1 + exp(—y))~* with corresponding cdf F(y) = (1 +exp(—y))7!. 
The score-generating function is in this case given by ut 2u — 1, leading to 
the scores 


2i 
n+1 


a(i) = E[L o F"!(U;-n)] = 


These scores are an affine transformation of the identity and therefore, the test 
can equivalently be carried out by means of the test statistic 


ny 


T=) _R(Y), 


i=1 


which is the sum of the ranks in the first group. 
(iv) Median test: 
The Lebesgue density of the Laplace distribution is given by f(y) 
exp(—|y|)/2, with induced score-generating function u / sgn(In(2u)) = 
sgn(2u — 1). Approximate scores are therefore given by 


II 


1, ifi>(n+1)/2, 
b(i) = Lo F-(——) = 0, ifi =(+1)/2, 
-l, ifi<(+4+1)/2. 


We conclude this section with the Savage test (or log-rank test), an example for a 
scale parameter test, cf. Example 8.4.1(b). 


Example 8.4.4. Under the scale parameter model considered in Example 8.4.1(b), 
assume that X is exponentially distributed with density (x)= exp(—x) To,o0)(x). 
Then we obtain for y > 0 the score function 


f'(y) 
La =O Pe See 
f(y) 
Exercise 8.4.2. Show that for iid. random variables Y;,...,Y, with Y; ~ 


Exp(1), it holds 
E[Yi-n] = ; — 
tin} — n 4 1 _ j > 


j=l 


Making use of the latter result, exact scores are given by 
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sy! 
Uae 2 rear a 


j=l 


Since X is almost surely positive, the model Y = exp(@)X can be transformed 
into the location parameter model log(Y) = 0 + log(X). For X ~ Exp(1), it 
holds that log(X) possesses a reflected Gumbel distribution, satisfying 


P(log(X) < x) = 1 —exp(—exp(x)), x >0. 


8.4.3 Permutation Tests 


Permutation tests can be regarded as special instances of rank tests for k -sample 
problems. 


Example 8.4.5 (Two-Sample Problem in Gaussian Location Parameter Model). 

Let (Yi)i<i<n denote real-valued, stochastically independent random variables, 
where Yj,...,Y,, are iid. with Yj; ~ Fy and Y,,41,...,¥, are iid. with 
Y,,+1 ~ F). Assume that the test problem of interest is given by 


Ao: {F, = Fy} versus Ay: {F, # F}. (8.11) 


In the special case that F, and fF, are Gaussian cdfs which only differ in their 


means, one would compare the empirical group means to carry out a test for problem 


(8.11). More specifically, we let n fn n, and define group means by Yy, e 


= > def 
ny! jh, Y; and Y;, = ny! Yo" _,,41 Y;, assuming that 0 < ny <n. The test 


i=1 
js : ; : ~ def j= > 
statistic of the resulting two-sample Z -test is then given by T = Yn, —Yn2 


and the test for (8.11) can easily be calibrated by noticing that re _ Yn is again 
normally distributed under Ho . 


However, in the case of general F, and F>, exact distributional results for T 
are difficult to obtain. Assuming that F, and F> are continuous, we consider more 
general statistics of the form 


T = )laig(¥i) = Do cong Vin) (8.12) 


i=1 i=1 


for a given function g : R — R and real numbers (c;)1<i<n . 

The representation of T on the right-hand side of (8.12) establishes the connec- 
tion to rank tests. For example, |T| equals T if we choose g = id, c; = ng 
for i < n, and c; = —n;! for i > n,. Under Ho from (8.11), the antiranks 


D(Y) = (Di(Y)))i<i<n and the order statistics (Yi:n)i<i<n are stochastically 
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independent, see Theorem 8.4.5. Due to this property, the two-sample permutation 
test based on T can be carried out according to the following resampling scheme. 


Example 8.4.6 (Resampling Scheme for a Two-Sample Permutation Test). The 
following resampling scheme is appropriate for a one-sided “stochastically larger” 
alternative. The two-sided case is obtained by obvious modifications. 


(A) Consider the order statistics (Yj:n)1<i<n and regard a(i) = g(Yin) as random 
scores. 

(B) Denote by D= (D;), <i<n arandom vector which is uniformly distributed on 
S, and let c = c(q@, (Yj:n)1<;<n) denote the (1 — a) -quantile of the discretely 
distributed random variable D > peers B; a(i). 

(C) The permutation test @ for testing (8.11) is then given by 


1, T>c, 
@=)y, T=c, 
0 T<e, 


where y € [0, 1] denotes a randomization constant. 


Remark 8.4.3. If we choose g = id and (c;)1<j<n as in Example 8.4.5, leading 
to |T| = T, then the test ¢ from Example 8.4.6 is called Pitman’s permutation 
test, see Pitman (1937). 


The permutation test principle can be adapted to test the more general null 
hypothesis 


Ao: Y\,...,¥n are iid. (8.13) 


In the generalized form, the Y;:1 < j <n are not even restricted to be real-valued. 
The modified resampling scheme is given as follows. 


Example 8.4.7 (Modified Resampling Scheme for General Permutation Tests). 


(A) Consider n random variates Y;, 1 < j <n with values in some space Y 
and a real-valued test statistic T = T(1%,...,Yn). 

(B) In the remainder, consider permutations z with values in S, which are 
independent of Yj,...,Y,. 

(C) Denote by Qo the uniform distribution on S, and let c = c(%,...,Yn) 
denote the (1 — a) -quantile of f+ Qo({a € S, : T(Vaq),.--. Yam) < th). 

(D) The modified permutation test d for testing (8.13) is then given by 


8.5 Historical Remarks and Further Reading 267 


Theorem 8.4.6. Under the respective assumptions, the permutation test @ defined 
in Example 8.4.6 and the modified permutation test d defined in Example 8.4.7 are 
under the null hypothesis Ho from (8.11) or (8.13), respectively, tests of exact level 
a for any fixed n EN. 


Proof. Conditional to the order statistics (Example 8.4.6) or to the data themselves 
(Example 8.4.7), the critical value c and the randomization constant y are chosen 
such that 


E, lv |¥ = y¥] = Eg.¢|¥ = y] =a 


holds true. Furthermore, the antiranks D(Y) are under Ho from (8.11) stochasti- 
cally independent of the order statistics. Analogously, the random permutations 
are chosen stochastically independent of (Y;,..., ¥,,) in the case of b . The result 
of the theorem follows by averaging with respect to the distribution of Y. 


8.5 Historical Remarks and Further Reading 


The Kolmogorov—Smirnov test goes back to Kolmogorov (1933) and Smirnov 
(1948). The origins of the @ test can be traced back to Cramér (1928) and 
the German lecture notes by von Mises (1931). The limiting @? distribution 
has been derived in the work by Smirnov (1937). A comprehensive resource 
for (nonparametric) goodness-of-fit tests is the book edited by D’ Agostino and 
Stephens (1986). 

The concept of Bayes factors goes back to Jeffreys (1935) and is treated 
comprehensively by Kass and Raftery (1995). Bayesian approaches to hypothesis 
testing are discussed in Sect. 4.3.3 of Berger (1985); see also the references therein. 

Our treatment of score tests mainly follows (Janssen 1998). The theory of rank 
tests is developed in the textbook by Hajek and Sidak (1967). The classical reference 
for permutation tests is Pitman (1937). Recent textbook and monograph references 
on the subject are Good (2005), Edgington and Onghena (2007), and Pesarin and 
Salmaso (2010). 


Appendix A 
Deviation Probability for Quadratic Forms 


A.1 Introduction 


This chapter presents a number of deviation probability bounds for a quadratic form 
\|& ||? or more generally ||Bé||? of a random p vector & satisfying a general 
exponential moment condition. Such quadratic forms arise in many applications. 
Baraud (2010) lists some statistical tasks relying on such deviation bounds including 
hypothesis testing for linear models or linear model selection. We also refer to 
Massart (2007) for an extensive overview and numerous results on probability 
bounds and their applications in statistical model selection. Limit theorems for 
quadratic forms can be found e.g. in Gétze and Tikhomirov (1999) and Horvath 
and Shao (1999). Some concentration bounds for U-statistics are available in 
Bretagnolle (1999), Giné et al. (2000), Houdré and Reynaud-Bouret (2003). Most of 
results assumes that the components of the vector € are independent and bounded. 

Hsu et al. (2012) study the tail behavior of the quadratic form under the condition 
of sub-Gaussianity of the random vector & and show that the deviation probability 
are essentially the same as in the Gaussian case. However, the assumption that the 
vector € has finite exponential moments of arbitrary order is quite strict and is 
not fulfilled in many applications. A particular example is given by the Poisson 
and exponential cases. In the present work we only suppose that some exponential 
moments of & are finite. This makes the problem much more involved and requires 
new approaches and tools. 

If & is standard normal then ||&||* is chi-squared with p degrees of freedom. 
We aim to extend this behavior to the case of a general vector & satisfying the 
following exponential moment condition: 


log Eexp(y'&) < |lyl’?/2,.  » € R?, llyll sg. (A.1) 


Here g is a positive constant which appears to be very important in our results. 
Namely, it determines the frontier between the Gaussian and non-Gaussian type 
deviation bounds. Our first result shows that under (A.1) the deviation bounds for 
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the quadratic form ||&||? are essentially the same as in the Gaussian case, if the 
value g* exceeds Cp for a fixed constant C. Further we extend the result to 
the case of a more general form ||Bé||?. An important advantage of the presented 
approach which makes it different from all the previous studies is that there is no 
any additional conditions on the structure or origin of the vector & . For instance, 
we do not assume that € is a sum of independent or weakly dependent random 
variables, or components of € are independent. The results are exact stated in a non- 
asymptotic fashion, all the constants are explicit and the leading terms are sharp. 

As a motivating example, we consider a linear regression model Y = V'9*+e 
in which Y isa n -vector of observations, e is the vector of errors with zero mean, 
and W isa p xn design matrix. The ordinary least square estimator 6 for the 
parameter vector 6* € R? reads as 


~ -1 
6=(ww') wy 
and it can be viewed as the maximum likelihood estimator in a Gaussian linear 


model with a diagonal covariance matrix, that is, Y ~ NWO ,o°1,) . Define the 
Pp X p matrix 


D2= ww, 
Then 
Do(6 — 6*) = Do's 
with ¢ “ We. The likelihood ratio test statistic for this problem is exactly 


|| D> 'é||?/2. Similarly, the model selection procedure is based on comparing such 
quadratic forms for different matrices Do; see e.g. Baraud (2010). 

Now we indicate how this situation can be reduced to a bound for a vector & 
satisfying the condition (A. 1). Suppose for simplicity that the entries ¢; of the error 
vector € are independent and have exponential moments. 


(e1) There exist some constants Vg and g; > 0, and for every i a constant §; 
2 
such that E(e;/s;) <1 and 


log Eexp(A«;/8;) < vgA7/2, |A| < gu. (A.2) 


Here gj, is a fixed positive constant. One can show that if this condition is 
fulfilled for some g; > O and a constant vo => 1, then one can get a similar 
condition with vo arbitrary close to one and gj, slightly decreased. A natural 
candidate for S; is o; where o? = Ee? is the variance of ¢;. Under (A.2), 


introduce a p x p matrix Vo defined by 


2 def 2 T 
Ve = Do shwwy, 
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where W,..., YW, € JR? are the columns of the matrix WV. Define also 
&=V,'We, 


s|Ur 
N72 max sup creed 
i yer? |lVorll 


Simple calculation shows that for ||y|| < g = g,N'/? 


log Eexp(y'&) < vollyl?/2,.  » ER’, lll <a. 


We conclude that (A.1) is nearly fulfilled under (e;) and moreover, the value g” is 
proportional to the effective sample size N . The results below allow to get a nearly 
7? -behavior of the test statistic ||& |? which is a finite sample version of the famous 
Wilks phenomenon; see e.g. Fan et al. (2001), Fan and Huang (2005), Boucheron 
and Massart (2011). 

Section A.2 reminds the classical results about deviation probability of a 
Gaussian quadratic form. These results are presented only for comparison and to 
make the presentation selfcontained. 

Section A.3 studies the probability of the form P(||&|| > y) under the condition 


log Eexp(y'&) < vgllyl?/2, » ERY, llyll <a. 


The general case can be reduced to vo = 1 by rescaling € and g: 


log Eexp(y '€/vo) < lly|?/2, yy € R’, llyll < vog 


that is, vp '& fulfills (A.1) with a slightly increased g. 

The obtained result is extended to the case of a general quadratic form in 
Sect. A.4. Some more extensions motivated by different statistical problems are 
given in Sects. A.6 and A.7. They include the bound with sup-norm constraint and 
the bound under Bernstein conditions. Among the statistical problems demanding 
such bounds is estimation of the regression model with Poissonian or bounded 
random noise. More examples can be found in Baraud (2010). All the proofs are 
collected in Sect. A.8. 


A.2 Gaussian Case 


Our benchmark will be a deviation bound for ||& ||? for a standard Gaussian vector 
€. The ultimate goal is to show that under (A.1) the norm of the vector & 
exhibits behavior expected for a Gaussian vector, at least in the region of moderate 
deviations. For the reason of comparison, we begin by stating the result for a 
Gaussian vector &. We use the notation a Vv b for the maximum of a and |, 
while a Ab = min{a,b}. 
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Theorem A.2.1. Let & be a standard normal vector in R? . Then for any u> 0, 
it holds 


P(|é ||? > p +4) < exp{—(p/2)¢(u/p)]} 


with 


g(t) = t —log(1 +2). 


Let @~'(-) stand for the inverse of $(-). For any x, 


P(lél’ > p+ pd |(2x/p)) < exp(—x). 


This particularly yields with x = 6.6 


P(lE||? > p + /xxD V (xx)) < exp(—x). 


This is a simple version of a well known result and we present it only for 
comparison with the non-Gaussian case. The message of this result is that the 
squared norm of the Gaussian vector € concentrates around the value p and its 
deviation over the level p + ,/xp is exponentially small in x. 

A similar bound can be obtained for a norm of the vector B& where B is 
some given deterministic matrix. For notational simplicity we assume that B is 
symmetric. Otherwise one should replace it with (B' B)!/?. 


Theorem A.2.2. Let & be standard normal in R? . Then for every x > 0 and any 
symmetric matrix B , it holds with p = tr(B?), v* = 2tr(B*), and a* = ||B?|loo 


P(||Bé|? > p + (2vx""”) v (6a*x)) < exp(—x). 


Below we establish similar bounds for a non-Gaussian vector € obeying (A.1). 


A.3 A Bound for the £,-Norm 


This section presents a general exponential bound for the probability P(||é|| > y) 
under (A.1). The main result tells us that if y is not too large, namely if y < 
Ye with y? =< g’, then the deviation probability is essentially the same as in the 
Gaussian case. 

To describe the value y,, introduce the following notation. Given g and p, 
define the values wy = gp~!/? and w, by the equation 


we(1 + We) 1/2 


ie (A.3) 
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It is easy to see that wo/ /2 < we < Wo. Further define 


fle = w2/(1 + w2) 


ye = (1+ w2)p, 


x, = 0.5p[w2 — log(1 + w2)]. (A.4) 


Cc 


Note that for g* > p, the quantities y, and x, can be evaluated as y, = we p= 
g?/2 and x, = pw?/2> 97/4. 


Theorem A.3.1. Let & € R? fulfill (A.1). Then it holds for each x < x- 
P(|lEl’ > p + VxxP V (ex), IIEll S ve) < 2exp(—x), 


where x = 6.6. Moreover, for y = ye, it holds with ge = g—./feP = gwe/UA+ 
We) 


P(|lél| > y) < 8.4exp{—g-y/2 — (p/2) log(1 — g-/y)} 
= 8.4 exp{—x- = goly= Ye)/2}. 


The statements of Theorem A.4.1 can be simplified under the assumption g?> p. 


Corollary A.3.1. Let & fulfill (A.1) and g? > p. Then it holds for x < x¢ 


P(E ||? = a(x, p)) < 2e-* + 8.4%, x (A.5) 
def + /exp, X< p/X, 
a(x, p) 2h? P/ (A.6) 
p+xux p/t <x, 


with x = 6.6. For x > X¢ 


-—x def 2. 
P(\lE ||? = 3c(x, p)) <8-4e%*, eX, p) = [ve + 2(K — Xe)/Ge|”. 
This result implicitly assumes that p < xx. which is fulfilled if w5 = 9?/p > 1: 
xx. = 0.5x[wo — log(1 + wo) |p = 3.3[1 — log(2)]p > p. 


For x < x,, the function 3(x, p) mimics the quantile behavior of the chi-squared 
distribution ve with p degrees of freedom. Moreover, increase of the value g 
yields a growth of the sub-Gaussian zone. In particular, for g = oo, a general 
quadratic form ||&||? has under (A.1) the same tail behavior as in the Gaussian 
case. 
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Finally, in the large deviation zone x > x, the deviation probability decays 
as e-°*"”” for some fixed c. However, if the constant g in the condition (A.1) 
is sufficiently large relative to p, then x, is large as well and the large deviation 
zone X > X, can be ignored at a small price of 8.4e~** and one can focus on the 
deviation bound described by (A.5) and (A.6). 


A.4 A Bound for a Quadratic Form 


Now we extend the result to more general bound for || Bé ||? = €' B?& with a given 
matrix B anda vector & obeying the condition (A.1). Similarly to the Gaussian 
case we assume that B is symmetric. Define important characteristics of B 


def def 
p = tr(B’), v* = 2tr(B*), Ap = ||B'lloo = Amax(B”). 


For simplicity of formulation we suppose that Ag = 1, otherwise one has to replace 
p and v? with p/Ag and v7/Az. 
Let g be shown in (A.1). Define similarly to the £-case w, by the equation 


we (1 + We) = 1/2 
(1 + w2)!/2 ; 


Define also w. = w2/(1 + w2) A 2/3. Note that w2 > 2 implies uw. = 2/3. 
Further define 


y? = (1+ w?)p, 2x. = pcy? + logdet{I , — u-B?}. (A.7) 


Similarly to the case with B = I,,, under the condition g* > p, one can bound 
y? > 97/2 and x, = g?/4. 


Theorem A.4.1. Leta random vector & in R? fulfill (A.1). Then for each x < X¢ 
P(||Bé|? > p + (2vx"/) v (6x), |BEl| < yc) < 2exp(—x). 
Moreover, for y = Yc, with go = 9 — ./McePR = Gwe/(1 + we), it holds 
P(||Bé|| > y) < 8.4exp(—x. — ge(y — ye)/2). 


Now we describe the value 3(x, B) ensuring a small value for the large deviation 
probability P(|| Bé||? > 3(x, B)) . For ease of formulation, we suppose that g” > 
2p yielding w~' < 3/2. The other case can be easily adjusted. 


Corollary A.4.1. Let & fulfill (A.1) with g? > 2p. Then it holds for x < x, with 
x, from (A.7): 
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P(||B&||? = 3(x, B)) < 2e-* + 8.4e™, 


det P+ Qvx'/?, x < v/18, 


3(x, B) = (A.8) 


p+ 6x v/1B<x< xX. 


For xX > X¢ 


P(||BE||? > s(x, B)) <8.4e%, g(x, B)S lve + 2(x- xe)/3e|. 


A.5_ Rescaling and Regularity Condition 


The result of Theorem A.4.1 can be extended to a more general situation when the 
condition (A.1) is fulfilled for a vector ¢ rescaled by a matrix Vo. More precisely, 
let the random p-vector € fulfills forsome p x p matrix Vo the condition 
vig 
sup log Eexp(a 7 — ) <v212/2, [Al <a, (A.9) 
yeR? Yor ll 


with some constants g > 0, vo > 1. Again, a simple change of variables reduces 
the case of an arbitrary vo => 1 to vo = 1. Our aim is to bound the squared norm 
|| D> 'é||? ofa vector Dy! for another px p positive symmetric matrix Dj . Note 
that condition (A.9) implies (A.1) for the rescaled vector € = Vols . This leads to 
bounding the quadratic form ||Dy'Vé||? = ||Bé||? with B? = Dj'VZD_!. It 
obviously holds 


p = tr(B) = (Dy). 


Now we can apply the result of Corollary A.4.1. 


Corollary A.5.1. Let ¢ fulfill (A.9) with some Vo and g. Given Do, define B? = 
De VD, , and let g* > 2p. Then it holds for x < X_ with x, from (A.7): 


P(I|Do ‘Sl? = a(x, B)) < 2e% + 8.4e™, 
with 3(x, B) from (A.8). For x > X¢ 


= —x def 2 
P(||Do I? > 3c(X, B)) = 84e™, 3c(x, B) = lye + 2(x — xe) /Ge| : 


In the regular case with Dy > aVy for some a > 0, one obtains ||Blloo < a! 


and 


v’ = 2tr(B*) < 2a’p. 
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A.6 A Chi-Squared Bound with Norm-Constraints 


This section extends the results to the case when the bound (A.1) requires some 
other conditions than the £2 -norm of the vector y . Namely, we suppose that 


log Eexp(y'&) < llyl?/2,  » ER’, lly llo < go. (A.10) 
where || - ||, is a norm which differs from the usual Euclidean norm. Our driving 
example is given by the sup-norm case with ||y ||, = ||y |loo. We are interested to 


check whether the previous results of Sect. A.3 still apply. The answer depends on 
how massive the set A(r) = {y: |ly|lo < r} is in terms of the standard Gaussian 
measure on R?. Recall that the quadratic norm |le||? of a standard Gaussian 
vector ¢ in R? concentrates around p at least for p large. We need a similar 
concentration property for the norm || - ||, . More precisely, we assume for a fixed 
rx that 


P((lello <r) = 1/2, e~N(0,1,). (A.11) 
This implies for any value u. > 0 andall uw € R? with |lu||. < uo that 
P(|le —ullo < rs + uo) = 1/2, e~ N(0,/,). 
For each 3 > p, consider 


LG) = (3 — p)/3. 
Given uo , denote by 30 = 30(uo) the root of the equation 


Jo ry 
= =u 
(30) h!/? (30) 


One can easily see that this value exists and unique if up > go — rx and it can be 
defined as the largest 3 for which ne - AG) > uo. Let Mo = [L(30) be the 
corresponding p-value. Define also x, by 


- (A.12) 


2X0 = Mojo + p log(] — po). 


If uo < go — Tx, then set 3, = 00, X = OO. 


Theorem A.6.1. Let a random vector & in R? fulfill (A.10). Suppose (A.11) and 
let, given Uo, the value 3. be defined by (A.12). Then it holds for any u > 0 


P(|E|? > p +4, Ello < uo) < 2exp{—(p/2)(u)]}. (A.13) 


A.7 A Bound for the £) -Norm Under Bernstein Conditions 277 


yielding for x < Xo 


P(E? > p + exp V (xx), l[Ello < Uo) < 2exp(—x), — (A.14) 


where x = 6.6. Moreover, for 3 = 30, it holds 


P(IEIP > 5, Ilo <0) < 2exp{—H105/2 ~ (P/2) log ~ H0)} 
= 2exp{—xo — Go(3 — 30) /2}. 


It is easy to check that the result continues to hold for the norm of Ié for a 
given sub-projector TI in R? satisfying TI = I', I? < I. As above, denote 


pe © (1), v2 & 2 tr(II*) . Let r, be fixed to ensure 
P(||Tello <r) > 1/2, © ~N(@,I,). 


The next result is stated for g, > rx + Uo, which simplifies the formulation. 


Theorem A.6.2. Let a random vector & in R? fulfill (A.10) and TI follows TI = 


Tr’, TW? < I. Let some uo be fixed. Then for any [bo < 2/3 with gos! — 


-1/2 ~ 
Tx[o " 2 Uo, 


mn Mo 
Bexp{ (Ig ||? — p)} 4117s ll. < ue) < 2exp(uiv7/4), (A.15) 
where v? = 2tr(II*). Moreover, if go > rx + Uo, then for any 3 > 0 


P(||T1é||? > 3, ||T°Ello < vo) 
< P(|é |)? > p + (2vx"/”) v (6x), || T17é ll. < uo) < 2exp(—x). 


A.7 A Bound for the £2 -Norm Under Bernstein Conditions 


For comparison, we specify the results to the case considered recently in Baraud 
(2010). Let € be a random vector in R” whose components ¢; are independent 
and satisfy the Bernstein type conditions: for all |A| < c! 


log Ee! < ——_., (A.16) 


Denote € = €/(20) and consider ||y||o = ||ylloo. Fix go = a/c. If |lyllo < go, 
then 1 —cy;/(20) > 1/2 and 


USL I 2 
log Eexp(y! &) < < YlosBexp(4# )< dj Ma eo < |ly|I?/2. 


I 


278 A Deviation Probability for Quadratic Forms 


Let also S be some linear subspace of IR” with dimension p and IIs denote the 
projector on S . For applying the result of Theorem A.6.1, the value r, has to be 
fixed. We use that the infinity norm ||e||,. concentrates around ./2 log p. 


Lemma A.7.1. Jt holds for a standard normal vector ¢ € R? with rz = ./2 log p 
P(|lello < rx) 2 1/2. 
Indeed 


P(|lello > r+) S P(lelloo > ¥2log p) < pP(le1| > V2 log p) < 1/2. 


Now the general bound of Theorem A.6.1 is applied to bounding the norm of 
|| 11s || . For simplicity of formulation we assume that go > Uo + rx. 


Theorem A.7.1. Let S be some linear subspace of R" with dimension p. Let 
Go > Uo + rx. If the coordinates ¢; of § are independent and satisfy (A.16), then 
forall x, 


P((40°)"||T1s§ |) > p+ V2xp V (ex), || IT s6lloo < 200) < 2exp(—x), 


The bound of Baraud (2010) reads 


P(IMsé > Bo V V6cu) /x + 3p, |lTTsflloo < 20) 2%, 


As expected, in the region x < x, of Gaussian approximation, the bound of Baraud 
is not sharp and actually quite rough. 


A.8 Proofs 


Proof of Theorem A.2.1 


The proof utilizes the following well known fact, which can be obtained by 
straightforward calculus : for yw < 1 


log Eexp(yl|é ||?/2) = —0.5p log(1 — y). 


Now consider any u > 0. By the exponential Chebyshev inequality 


P(E’ > p tu) < exp{—p(p + u)/2}Eexp(p/é ||7/2) (A.17) 
= exp{—(p + u)/2 — (p/2) log(1 — )}. 
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It is easy to see that the value 4 = u/(u+ p) maximizes (p +u) + p log(1— yp) 
w.r.t. yielding 


Lp + u) + plog( — pw) = u— plog(l + u/p). 
Further we use that x—log(1+.x) > dox? for x < 1 and x—log(1+x) > aox for 


x > 1 with ap = 1 — log(2) > 0.3. This implies with x = u/p for u = ./xxp 
or uw = xx and x = 2/do < 6.6 that 


P(E? = p+ /xxp V (xx)) < exp(—x) 


as required. 


Proof of Theorem A.2.2 
The matrix B? can be represented as U' diag(a),...,a p)U_ for an orthogonal 
matrix U. The vector € = Ué is also standard normal and ||Bé||?_ = 


pal 2 : : : . 

& UB?U'&. This means that one can reduce the situation to the case of a diagonal 
matrix B? = diag(a\,...,a p). We can also assume without loss of generality that 
a, > az =>... = a,. The expressions for the quantities p and v simplifies to 


p = tr(B?) =a, +...+4p, 


v? = 2tr(B*) = 2a} +... +45). 


Moreover, rescaling the matrix B* by a; reduces the situation to the case with 
ay= 1. 


Lemma A.8.1. Jt holds 
BI) BEI? = w(B), —Var(| BEI?) = 2ur(B4). 

Moreover, for ft < 1 

Pp 
Eexp{ul| B& |"/2} = det(1 — pB?)7'? = | [Ud — pai)”. (A.18) 

i=] 

Proof. If B* is diagonal, then ||Bé|/? = 5°; a;&? and the summands a;&? are 

independent. It remains to note that E(a;§?) = Gj; Var(a; &?) = 2n2 , and for 


fa; <1, 


Eexp{yaig?/2} = (1 — pai” 


yielding (A.18). 
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Given u, fix  < 1. The exponential Markov inequality yields 


_ eet Ve exp( HBT) 


P(\|BEI? > p+ u) < exp{-2 


P 


< exp{—— = 5 lea + log(1 — pai)|}. 


i=1 


We start with the case when x!/? < v/3. Then u = 2x!/?v fulfills uw < 2v?/3. 
Define 4 = u/v* < 2/3 and use that t + log(1 —t) > —t? for t < 2/3. This 


implies 
P(||BE||? > p + u) 


P 
< exp|— + ; > a? = exp(—v’/(4v’)) = e™. 


i=l 
Next, let x!/? > v/3.Set js = 2/3. It holds similarly to the above 


Pp 
i=1 i=l 


Now, for u = 6x and puu/2 = 2x, (A.19) implies 
P(\|Bé||* > p +u) < exp{—(2x — x)} = exp(—x) 


as required. 


Proof of Theorem A.3.1 


The main step of the proof is the following exponential bound. 


Lemma A.8.2. Suppose (A.1). For any ft <1 with g? > py, it holds 


pexp(“8l) a (Ie <9/n—- Vo/n) <20-mP*. 


P 
Y [ua + log(1 — wa;)] = — > wa? = -2v?/9 > -2x. 


(A.19) 


(A.20) 


Proof. Let e be a standard normal vector in R? and uw € R?. The bound 
P(|le|? > p) < 1/2 and the triangle inequality imply for any vector u and any 
r with x > |\u|| + p'/? that P(|ljw + e|| < r) => 1/2. Let us fix some & with 
||| < g/uw— p/m and denote by Pg the conditional probability given & . The 


previous arguments yield: 


Pe (le + /78l| < wo'/?g) = 0.5. 
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It holds with c, = (27)~?/? 


cy f exn(yt8— ZL) aly < ay 


II 


Cp exp(l|é ||’/2) j exp(—5 |"? - oa?) I <p'?a)dy 


w?!? exp(ulé |[?/2)Pe (lle + w/78|] < w/a) 
0.5u?!? exp(ul|é||?/2). 


II 


IV 


because ||j!/7&|| + p!/? < w7!/2g. This implies in view of p < g*/j that 
exp(1ll€ ||7/2) (IIE? < g/u — VP/H) 
2 alk 
<2 ey f exp(yTs - 2) 1 < dy. 


Further, by (A.1) 


co f exp(y7§ — = I"IF)Myll = aay 


lA 


2p 
—l 
-1 
cp f exe(— Iv IP) vil = dy 


cp f exe(- I P)ay 


Gt =1?? 


IA 


IA 


and (A.20) follows. 


Due to this result, the scaled squared norm j4||&||?/2 after a proper truncation 
possesses the same exponential moments as in the Gaussian case. A straightforward 
implication is the probability bound P(||& ||? > p + u) for moderate values uw. 
Namely, given u > 0, define w = u/(u+ p). This value optimizes the inequality 
(A.17) in the Gaussian case. Now we can apply a similar bound under the constraints 
lE|| < g/u — V p/p. Therefore, the bound is only meaningful if u+ p < 
g/u—J/p/p with w= u/(ut p), or, with w = /u/p < we; see (A.3). 

The largest value u for which this constraint is still valid, is given by p+u = y?. 
Hence, (A.20) yields for p + u < y2 


P(iléll’ > p + u, |IEll < ve) 


were Sexp( AISI) (yg < 9/u- Vp/n) 


— 


< exp{— 
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< 2exp{—0.5[u(p + uw) + plog(1 — 1)]} 
= 2exp{—0.5[u— plog(1 + u/p)]}. 


Similarly to the Gaussian case, this implies with x = 6.6 that 


P(IlEll = p + «xp V (ex), |lEll S ve) S 2exp(—x). 


The Gaussian case means that (A.1) holds with g = oo yielding y; = oo. In the 
non-Gaussian case with a finite g, we have to accompany the moderate deviation 
bound with a large deviation bound P(||é|| > y) for y > yc. This is done by 
combining the bound (A.20) with the standard slicing arguments. 


Lemma A.8.3. Let [10 < g’/p. Define yo = 9/10 — y p/ [bo and go = /Loyo = 
g—./pop. It holds for y = yo 


P(\lé|| > vy) < 8.4. — go/y) ?”” exp(—goy/2) (A.21) 
< 8.4exp{—x0 — goly — yo)/2}. (A.22) 


with xq defined by 


2x9 = Moyo + plog(1 — Wo) = 9°/o — p + plog(1 — jU0). 


Proof. Consider the growing sequence y;, with y; = y and goyz4+1 = QGoy tk. 
Define also “x = Go/yx . In particular, we < “1 = go/y . Obviously 


co 


P(IEll > v) = >> PEM > ve- lll < ves). 


k=1 


Now we try to evaluate every slicing probability in this expression. We use that 


2 (goy + k — 1)? 
easy = Se my + 7, 
ae goy +k 


and also g/j4z — y/p/mke = ye because g— go = ./Mop > ./fep and 
g/ bk — VP/ be — Ye = He (G— VRP — Qo) = 0. 


Hence by (A.20) 


co 


P(gl > vy) = Yo P(IEll > ve. Ill < ye+1) 


k=1 
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2 2 
exp(— PEE) exp( MUI) 3 (81) < yess) 


Me 


> 
i 


< “ 


2, 
2(1— weti) ? oo) 


[12 


exp(— 


ti 


= 2(1 a way > exp(-2¥ 48?) 
k=l 


= 2/24 _ gayi = pu) ?!? exp(—goy/2) 
< 8.4(1 — 1)?” exp(—goy/2) 


and the first assertion follows. For y = yo, it holds 
goYo + P log(1 — Ho) = Hoyo + plog(1 — Ho) = 2x0 


and (A.21) implies P(||&|| > yo) < 8.4exp(—xo) . Now observe that the function 
f(y) = goy/2 + (p/2)log(1 — go/y) fulfills f(yo) = xo and f’(y) = go/2 
yielding f(y) => xo + go(y — yo)/2. This implies (A.22). 


II 


The statements of the theorem are obtained by applying the lemmas with [uo 
Me = w2/(1 + w2). This also implies yo = ye, Xo = Xc, and go = Ge 


— Jicp; cf. (AA). 


II 


Proof of Theorem A.4.1 


The main steps of the proof are similar to the proof of Theorem A.3.1. 


Lemma A.8.4. Suppose (A.1). For any <1 with g*/ > p, it holds 
Bexp(y||B&||’/2)1(||B°§\|<a/u—Vp/p) < 2det(Ip—w By. (A.23) 


Proof, With c)(B) = (2x)? det(B!) 
1 
cp(B) f exp(y™§- 5-18“ yI?) Uy < ody 
Bé||? 1 
= ep(Byexp( EEL) f exp(—5|ul??Bg — w'B'yP) yl < dy 


= p?!? ex pel \p Pe (Iu? Be + B°é|| <g/p), 
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where e denotes a standard normal vector in IR? and Pz means the conditional 
probability given & . Moreover, for any u € R? and r > p!/? + |lul|, it holds in 
view of P(|| Bel]? > p) < 1/2 


P(||Be —ull < r) > P(||Bell < Vp) = 1/2. 


This implies 
exp( || BE|I?/2) 1(||B°8| < a/n — v'p/n) 
<2 e(B) f exp(yT§— 1B") My < aay. 
Further, by (A.1) 


1 
ep(BYE f exp(y"§ - 5-1B-'yIP) ll < dy 


Ve 
< en(B) f exp ME — Sey i?)ay 


< det(B!) det(u-! B~? — 1,7"? = p?/? det, — wB?)7"/? 


and (A.23) follows. 


Now we evaluate the probability P(||B&|| > y) for moderate values of y. 


Lemma A.8.5. Let [io < 1A (g?/p). With yo = g/po— Vp/ jo. it holds for any 
u>0O 


P(|B&||? > p + w, ||B°E|| < yo) 
< 2 exp{—0.5p9(p + u) — 0.5 log det(I , — [0 B’)}. (A.24) 


In particular, if B? is diagonal, that is, B= diag(a1, ae Gp) , then 


P(||Bé|? > p + w, ||B°El| < yo) 


Lou 1 a 
< 2exp{—-5" — 5 [Hoa + log(1 — zoai)]}. (A.25) 


i=1 


Proof. The exponential Chebyshev inequality and (A.23) imply 
P(||Bé ||’ > p + u, | B°El| < yo) 


S exp{ MEF Vis exp( MOSSE gi ja (| B°é | < 9/po- Vp/o) 


< 2 exp{—0.5p0(p + u) — 0.5 logdet(1, — bo B’)}. 
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Moreover, the standard change-of-basis arguments allow us to reduce the problem 
to the case of a diagonal matrix Be = diag(a), ae Oy) where 1 = a; > ao > 
... 2 Gy, > 0. Note that p = a; +... +a,. Then the claim (A.24) can be written 
in the form (A.25). 


Now we evaluate a large deviation probability that ||Bé|| > y fora large y. 
Note that the condition ||B?||9o < 1 implies ||B7&|| < ||Bé||. So, the bound 
(A.24) continues to hold when ||B7&|| < yo is replaced by || Bé|| < yo. 


Lemma A.8.6. Let uo < 1 and pop < 9’. Define go by go = 9 — flop. For 
any y > yo = go/ Uo, it holds 


P(||BEl| > y) < 8.4 det{I, — (go/y)B*}'/? exp(—goy/2). 
< 8.4exp(—x0 — go(y — yo)/2), (A.26) 


where Xo is defined by 


2X0 = goyo a log det{I » _ (go/yo) B*}. 


Proof. The slicing arguments of Lemma A.8.3 apply here in the same manner. One 
has to replace ||&|| by ||B&|| and (1 — 41)~?/? by det{7, — (go/y)B?}"'/?.. We 
omit the details. In particular, with y = yo = go/j, this yields 


P(||Bé|| > yo) < 8.4exp(—x0). 


Moreover, for the function f(y) = goy + logdet{Z, — (go/y)B7}, it holds 
f'(y) = go and hence, f(y) = f(yo) + oly — yo) for y > yo. This implies 
(A.26). 

One important feature of the results of Lemma A.8.5 and Lemma A.8.6 is that 
the value fio < 1 A (g?/p) can be selected arbitrarily. In particular, for y > y<, 
Lemma A.8.6 with jo = [Me yields the large deviation probability P(||Bé|| > 
y) . For bounding the probability P(||Bé||? > p + u, ||Bé|| < y-), we use the 
inequality log(1 —t) > —t — ¢? for t < 2/3. Itimplies for  < 2/3 that 


—log P(||B&||? > p + u, || BEll < ve) 


Pp 
> w(p tu) + > log(1 — pa;) 


i=1 
P 
> w(p + u) — Yo (ua; + wa?) > pu — pv? /2. (A.27) 
i=1 


Now we distinguish between jz. = 2/3 and p, < 2/3 starting with uw. = 2/3. 
The bound (A.27) with = 2/3 and with u = (2vx'/?) v (6x) yields 
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P(||BE ||’ > p +4, | BEl| < yc) S 2exp(—x); 


see the proof of Theorem A.2.2 for the Gaussian case. 

Now consider yu. < 2/3. For x!/* < pev/2,use u = 2vx!/? and po = u/v". 
Itholds zp = u/v* < pe and u?/(4v7) = x yielding the desired bound by (A.27). 
For x!/? > yev/2, we select again wo = pc. It holds with u = 4u7'x that 
peu/2— p2v?/4 > 2x —x = x. This completes the proof. 


Proof of Theorem A.6.1 


The arguments behind the result are the same as in the one-norm case of Theo- 
rem A.3.1. We only outline the main steps. 


Lemma A.8.7. Suppose (A.10) and (A.11). For any <1 with go > pw'/?re, it 
holds 


Bexp(wll§ 7/2) U(Il§ lle < Go/m—ra/M”?) < 20 — wyP?. (A.28) 
Proof. Let e bea standard normal vector in R? and uw € R?. Let us fix some & 


with pu!/?\]& ||, < w~!/?go—r and denote by P, the conditional probability given 
& . It holds by (A.11) with cp = (21)~?/? 


1 
cp f exo(y"§ — 5 -InIP) Uys < eddy 


1 
= cp exp(ull€ 7/2) / exp(—5 u's — wy |?) 1e ale < wg) y 


w?!? exp(ullél?/2)Re (lle — w/o < wo! go) 
0.57’? exp(l|é ||7/2). 


II 


IV 


This implies 
lg IP 
exp(—S—) (Il llo < do/H — re/ 1") 
=p/2 T 1 2 
<u Pcp | exp(y™§ — > IyI) lylle = go). 
Further, by (A.10) 


1 
cp f exp(y78 — liv?) Alle <g)dy 
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-1 
wl = = 
<p f exp(- Iv Pay = wot = 1"? 


and (A.28) follows. 


As in the Gaussian case, (A.28) implies for 3 > p with w = w3) = G- 
P)/3 the bounds (A.13) and (A.14). Note that the value j1(3) clearly grows with 
3 from zero to one, while go/(3) — rx/j'/*(3) is strictly decreasing. The value 
3o is defined exactly as the point where go/j1(3) — r«/j!/?(3) crosses Uo, so that 
Jo/ (3) — rx/p'/?(3) > Uo for 3 < 30. 

For 3 > 30, the choice 4 = u(y) conflicts with go/u(3) — rx/'/2(3) = uo. 
So, we apply 4 = po yielding by the Markov inequality 


P(E ll? > 3, Ello < Uo) < 2exp{—103/2 — (p/2) log(1 — j20)}, 


and the assertion follows. 


Proof of Theorem A.6.2 


Arguments from the proof of Lemmas A.8.4 and A.8.7 yield in view of gojy! — 


-1/2 
lx Lo 2 Uo 


Eexp{oll118|?/211(\|T1°8 lo < uo) 


2 
< Eexp(o||T&||?/2)1(\| 1178 lo < go/fo — p/ps”) 
< 2det(I, — woI?)”. 


The inequality log(1 — tf) > —t —¢? for t < 2/3 and symmetricity of the matrix 
TI imply 


—log det(I, — oT) < pop + pov? /2 


cf. (A.27); the assertion (A.15) follows. 
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